Jörg Stryk (MVP - Dynamics NAV) about Performance Optimization & Troubleshooting with MS Dynamics NAV and MS SQL Server
Well, "Performance Optimization" in NAV & SQL is some sort of "Cornucopia of issues" – starting from Hardware Sizing, Platform Configuration, Index Tuning, SIFT Optimization, etc. … and – last but not least – the question: "How to prevent and solve blocking conflicts?"
At the "Directions EMEA 2008" Event in Paris I tried to give some feasible, practical advices to answer this question. The following should give some further explanations, in addition to the "Session Package" one could download from my website: http://www.stryk.info/directionsemea_blockdetection.zip
(Based on NAV 4.00 and SQL Server 2005)
### Changes 12.05.2008 ### >>>
Due to the great response I got on this BLOG I could further improve the scripts, thus the processes are more reliable, especially when it is about "translating" from "waitresource" into "table name". Additionally, now the relevant Index Names are shown, too.(Many thanks to Legosz (Mark) for his inspiration and support with this!)
### Changes 12.05.2008 ### <<<
"Locking" is actually some sort of reservation of a system-resource as a record, table, etc.. "Locking" is crucial to grant the consistency of data, and every database MUST lock at certain time at a certain extend. When we're talking about "locking problems" we actually mean "blocking problems"; means if two (or more) processes want to lock the same resource and a conflict is generated. Thus, we are encountering problems when a "lock" results in a "block". Hence, it is a matter of probability if/how/when such a conflict could occur. To reduce "blocking conflicts" it is necessary to reduce the probability of "locking conflicts" …
To minimize the risk of blocks it is necessary to keep the "lock granularity" (locking-level) as small as possible. Lock granules are for example Row, Range, Page, Index, Key, Table, etc.. The lower the granularity, the lower the probability of encountering blocks. For example, the probability that two processes at the same time want to work with the "Sales Header" table is much higher than the probability that these two processes want to work with the same record (the same document). Thus, if the "Sales Header" table would be locked, the risk of getting blocked if far higher than if just a single record was locked!
Usually SQL Server sets the optimal lock granularity, but with NAV this could be overruled (which is mostly a bad idea). It is a common misunderstanding that the "Always Rowlock" feature in NAV is required to lock on row-level. Regardless of this setting, SQL Server will ALWAYS start locking on the lowest possible granularity, e.g. row-level. If "Always Rowlock" is enabled, then NAV will add the query hint ROWLOCK which prevents "Lock Escalation": If SQL Server recognizes that the administration of multiple Row-Locks is costly (threshold is short before 1000 row-locks) it will escalate the locking level, means it will replace the many Row-Locks by e.g. few Range-Locks or Page-Locks, etc..
The ROWLOCK hint prevents this escalation and forces SQL Server to keep the Row-Lock granularity. Hence, "Always Rowlock" is indeed reducing blocking conflicts – due to the forced low granularity – but this could be at cost of the overall performance: all these locks have to be maintained and require a sufficient amount of RAM. I recommend to disable "Always Rowlock" and investigate and solve the arising blocking conflicts specifically, not wasting precious hardware resources.
With NAV before version 4.00 the lock granularity could also be set:
CREATE TABLE [$ndo$dbconfig] (config VARCHAR(1024)) GRANT SELECT ON [$ndo$dbconfig] TO [public] GO INSERT INTO [$ndo$dbconfig] VALUES ('DefaultLockGranularity=Yes')
(See also the "Application's Designers Guide")
The "Lock Granularity" is also influenced by the so called "Transaction Isolation Level" – the degree of how much a transaction is isolated from other transactions. With NAV the default level is READUNCOMMITED, means even data written by transactions that have not yet committed could be read from other transactions (aka "Dirty Reads").
With READUNCOMMITTED the risk of encountering blocks is minimal. The highest isolation level is SERIALIZABLE – here a transaction is completely isolated from others, hence it has the highest risk of generating blocking conflicts. With the NAV commands Record.LOCKTABLE or Record.FINDSET(TRUE) the TSQL statement SET TRANSACTION ISOLATION LEVEL is sent, thus the transaction is serialized. Further, after serialization NAV sends an UPDATELOCK hint which transforms the formerly set "Shared Locks" (S) into "Exclusive Locks" (X) (see "Books Online" for details).
Hence, serializing transactions could cause trouble, so it should be only done where it is really necessary, e.g. to avoid "Dirty Reads" (which is the case in all standard posting routines when Ledger Entries are created).
Automatic Block Detection
Before Blocks could be resolved it is necessary to know precisely the "Who is Who" about the conflicts; means "Who is blocking whom?", "When?" and "On which Resource?" – and of course: "How often does a conflict occur?". Solving blocking conflicts basically has to be done within the C/AL source code – and this task could be very complex and time consuming! Thus it is crucial to know about this "Who is Who" to fix the worse problems first, not spending time on singular events. The following describes one solution – there might be plenty – which I consider quite simple but smart:
Step 1 – Block Log Table
To collect the data about blocks we need a table ("ssi_BlockLog") for storage:
USE [Navision] GO CREATE TABLE [dbo].[ssi_BlockLog] CREATE TABLE [dbo].[ssi_BlockLog] ( [entry_no] bigint identity constraint [ssi_BlockCheck_Tab$pk_ci] primary key clustered, [timestamp] datetime, [db] varchar(128) collate database_default, [waitresource] varchar(128), [table_name] varchar(128) collate database_default, [index_name] varchar(128) collate database_default, [waittime] bigint, [lastwaittype] varchar(128), [spid] int, [loginame] varchar(128) collate database_default, [hostname] varchar(128) collate database_default, [program_name] varchar(128) collate database_default, [cmd] nvarchar(max) collate database_default, [query_plan] xml, [status] varchar(128) collate database_default, [cpu] bigint, [lock_timeout] int, [blocked by] int, [loginame 2] varchar(128) collate database_default, [hostname 2] varchar(128) collate database_default, [program_name 2] varchar(128) collate database_default, [cmd 2] nvarchar(max) collate database_default, [query_plan 2] xml, [status 2] varchar(128) collate database_default, [cpu 2] bigint, [block_orig_id] int, [block_orig_loginame] varchar(128) collate database_default )GO
Step 2 – Stored Procedure to save the data
A "Stored Procedure" "ssi_blockdetection" is used to extract the block-information from various DMV (Dynamic Management Views) and save it into table ssi_BlockLog (from step 1).
Step 3 – Job to save the data
DROP PROCEDURE [dbo].[ssi_blockdetection] GO
CREATE PROCEDURE [dbo].[ssi_blockdetection] @mode varchar(10) = 'loop', -- "loop" or "once" @threshold int = 1000, -- Block threshold in milliseconds @frequency int = 3, -- Check frequency in milliseconds @save tinyint = 0 -- save output to table ssi_BlockLog (0 = no, 1 = yes)with encryptionas
if @mode <> 'once' begin print '*********************************************************' print '*** STRYK System Improvement ***' print '*** Performance Optimization & Troubleshooting ***' print '*** (c) 2008, STRYK System Improvement, Jörg Stryk ***' print '*** www.stryk.info ***' print '*********************************************************' print ' Version 4.02, Date: 20.01.2009 ' print ''end
if (@mode not in ('loop', 'once')) begin raiserror ('ERROR: Invalid Parameter @mode: %s', 15, 1, @mode) returnendif (@threshold < 1) begin raiserror ('ERROR: Invalid Parameter @threshold: %i', 15, 1, @threshold) returnendif (@frequency < 1) begin raiserror ('ERROR: Invalid Parameter @frequency: %i', 15, 1, @frequency) returnendif (@save not in (0,1)) begin raiserror ('ERROR: Invalid Parameter @save: %i', 15, 1, @save) returnend
set nocount onset statistics io offdeclare @spid int, @spid2 int, @loginame varchar(128), @blocked_by int, @blocked_by_name varchar(128), @orig_id int, @orig_name varchar(128), @timestmp datetime, @i int
if @mode = 'once' goto start_check
while 1 = 1 begin
if exists (select * from sys.dm_exec_requests where [blocking_session_id] <> 0) begin print 'Checkpoint ' + convert(varchar(30), getdate()) if @save = 0 begin
select [db] = db_name(s1.[database_id]), [waitresource] = ltrim(rtrim(s1.[wait_resource])), [table_name] = object_name(sl.rsc_objid), [index_name] = si.[name], s1.[wait_time], s1.[last_wait_type], s1.[session_id], session1.[login_name], session1.[host_name], session1.[program_name], [cmd] = isnull(st1.[text], ''), [query_plan] = isnull(qp1.[query_plan], ''), session1.[status], session1.[cpu_time], s1.[lock_timeout], [blocked by] = s1.[blocking_session_id], [login_name 2] = session2.[login_name], [hostname 2] = session2.[host_name], [program_name 2] = session2.[program_name], [cmd 2] = isnull(st2.[text], ''), [query_plan 2] = isnull(qp2.[query_plan], ''), session2.[status], session2.[cpu_time] -- Process Requests from sys.dm_exec_requests (nolock) s1 outer apply sys.dm_exec_sql_text(s1.sql_handle) st1 outer apply sys.dm_exec_query_plan(s1.plan_handle) qp1 left outer join sys.dm_exec_requests (nolock) s2 on s2.[session_id] = s1.[blocking_session_id] outer apply sys.dm_exec_sql_text(s2.sql_handle) st2 outer apply sys.dm_exec_query_plan(s2.plan_handle) qp2 -- Sessions left outer join sys.dm_exec_sessions (nolock) session1 on session1.[session_id] = s1.[session_id] left outer join sys.dm_exec_sessions (nolock) session2 on session2.[session_id] = s1.[blocking_session_id] -- Lock-Info left outer join master.dbo.syslockinfo (nolock) sl on s1.[session_id] = sl.req_spid -- Indexes left outer join sys.indexes (nolock) si on sl.rsc_objid = si.[object_id] and sl.rsc_indid = si.[index_id] where s1.[blocking_session_id] <> 0 and (sl.rsc_type in (2,3,4,5,6,7,8,9)) and sl.req_status = 3 and s1.[wait_time] >= @threshold
end else begin
set @timestmp = getdate()
insert into [ssi_BlockLog] ([timestamp],[db],[waitresource],[table_name],[index_name],[waittime],[lastwaittype],[spid],[loginame],[hostname],[program_name],[cmd],[query_plan],[status],[cpu],[lock_timeout],[blocked by],[loginame 2],[hostname 2],[program_name 2],[cmd 2],[query_plan 2],[status 2],[cpu 2],[block_orig_id],[block_orig_loginame]) select @timestmp, [db] = db_name(s1.[database_id]), [waitresource] = ltrim(rtrim(s1.[wait_resource])), [table_name] = object_name(sl.rsc_objid), [index_name] = si.[name], s1.[wait_time], s1.[last_wait_type], s1.[session_id], session1.[login_name], session1.[host_name], session1.[program_name], [cmd] = isnull(st1.[text], ''), [query_plan] = isnull(qp1.[query_plan], ''), session1.[status], session1.[cpu_time], s1.[lock_timeout], [blocked by] = s1.[blocking_session_id], [login_name 2] = session2.[login_name], [hostname 2] = session2.[host_name], [program_name 2] = session2.[program_name], [cmd 2] = isnull(st2.[text], ''), [query_plan 2] = isnull(qp2.[query_plan], ''), session2.[status], session2.[cpu_time], [block_orig_id] = null, [block_orig_id] = null -- Process Requests from sys.dm_exec_requests (nolock) s1 outer apply sys.dm_exec_sql_text(s1.sql_handle) st1 outer apply sys.dm_exec_query_plan(s1.plan_handle) qp1 left outer join sys.dm_exec_requests (nolock) s2 on s2.[session_id] = s1.[blocking_session_id] outer apply sys.dm_exec_sql_text(s2.sql_handle) st2 outer apply sys.dm_exec_query_plan(s2.plan_handle) qp2 -- Sessions left outer join sys.dm_exec_sessions (nolock) session1 on session1.[session_id] = s1.[session_id] left outer join sys.dm_exec_sessions (nolock) session2 on session2.[session_id] = s1.[blocking_session_id] -- Lock-Info left outer join master.dbo.syslockinfo (nolock) sl on s1.[session_id] = sl.req_spid -- Indexes left outer join sys.indexes (nolock) si on sl.rsc_objid = si.[object_id] and sl.rsc_indid = si.[index_id] where s1.[blocking_session_id] <> 0 and (sl.rsc_type in (2,3,4,5,6,7,8,9)) and sl.req_status = 3 and s1.[wait_time] >= @threshold update [dbo].[ssi_BlockLog] set [table_name] = '- unknown -' where [table_name] is null
-- get block originator declare originator_cur cursor for select [blocked by], [loginame 2] from [dbo].[ssi_BlockLog] where [timestamp] = @timestmp for update open originator_cur fetch next from originator_cur into @blocked_by, @blocked_by_name while @@fetch_status = 0 begin set @i = 0 set @orig_id = @blocked_by set @orig_name = @blocked_by_name set @spid2 = @blocked_by while (@spid2 <> 0) and (@i < 100) begin if exists(select top 1 [blocked by] from [dbo].[ssi_BlockLog] where ([timestamp] = @timestmp) and ([spid] = @spid2)) begin select top 1 @spid = [blocked by], @loginame = [loginame 2] from [dbo].[ssi_BlockLog] where ([timestamp] = @timestmp) and ([spid] = @spid2) set @orig_id = @spid set @orig_name = @loginame set @spid2 = @spid end else set @spid2 = 0 set @i = @i + 1 -- "Emergency Exit", to avoid recursive loop end update [dbo].[ssi_BlockLog] set [block_orig_id] = @orig_id, [block_orig_loginame] = @orig_name where current of originator_cur fetch next from originator_cur into @blocked_by, @blocked_by_name end close originator_cur deallocate originator_cur
if @mode = 'once' return
waitfor delay @frequencyend
A SQL Server Agent Job will then execute the procedure "ssi_blockdetection" (from step 2) .
USE [msdb] GO BEGIN TRANSACTION DECLARE @ReturnCode INT SELECT @ReturnCode = 0 IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'STRYK System Improvement' AND category_class=1) BEGIN EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'STRYK System Improvement' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback END DECLARE @jobId BINARY(16) EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'SSI: Block Detection', @enabled=1, @description=N'Automatic Block-Detection by STRYK System Improvement, http://www.stryk.info', @category_name=N'STRYK System Improvement', @owner_login_name=N'sa', @job_id = @jobId OUTPUT IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'blockdetection', @step_id=1, @cmdexec_success_code=0, @on_success_action=1, @on_fail_action=2, @subsystem=N'TSQL', @command=N'EXECUTE ssi_blockdetection @mode=''once'', @threshold=1, @frequency=1, @save=1', @database_name=N'Navision' -- Change Database Name on demand IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback COMMIT TRANSACTION GOTO EndSave QuitWithRollback: IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION EndSave:
(Remark: the TSQL which extracts the data from the system-tables might be improved, it's just an example – please feel free to improve, I would appreciate your support!)
Step 4 – Alert to monitor "Processes Blocked"
To automatically keep track about the blocks a SQL Server Agent Alert ("SSI: Block Detection") could monitor the SQL Server Performance Counter "SQLServer::General Statistics – Processes blocked"; further it responds by executing the SQL Agent Job:(updated 29.01.2010 - thanks Oleksandr C. for the hint!)USE msdbGOdeclare @instance varchar(128), @perfcon varchar(256)if @@servicename = 'MSSQLSERVER' -- Standard-Instance set @instance = 'SQLServer'else -- Named Instance set @instance = 'MSSQL$' + @@servicenameset @perfcon = @instance + N':General Statistics|Processes blocked||>|0'
EXEC sp_add_alert @name=N'SSI: Block Detection', @message_id=0, @severity=0, @enabled=1, @delay_between_responses=10, @include_event_description_in=0, @performance_condition= @perfcon, @job_name=N'SSI: Block Detection'GO
Now, if the number of "Processes blocked" raises above 0 – a block was encountered – this Alert is triggered and executes the SQL Server Agent Job ("SSI: Block Detection" from Step 2) which logs the block-information - the "Who is Who of Blocks"!
(added 29.01.2010:) By default the "Performance Sampling Interval" of the SQL Server Agent is 20 seconds, thus, regardless of the "delay_between_responses" setting the actual minimum delay is 20 secs. This is defined within a Registry Key and could be changed (if feasible!):USE masterGOEXEC xp_instance_regwrite N'HKEY_LOCAL_MACHINE', N'SOFTWARE\Microsoft\MSSQLServer\SQLServerAgent', N'PerformanceSamplingInterval', REG_DWORD, 10 -- 10 Seconds
Caution: Any Registry changes have to be handled with care!
THAT'S IT – AUTOMATIC BLOCK DETECTION ESTABLISHED!!!
Finally we could easily analyze this data, e.g. finding out who is cause or victim of blocks, which resources are affected, how long do the blocks take, etc.. For example:
SELECT [waitresource], [table_name], [index_name], [blocked_login] = [loginame], [blocking_login] = [loginame 2], [block_orig_loginame], [count] = COUNT([entry_no]),[max_duration] = MAX([waittime])FROM dbo.ssi_BlockLog--WHERE [timestamp] BETWEEN '17.04.2008 00:00' AND '17.04.2008 23:59'GROUP BY [waitresource], [table_name], [index_name], [loginame], [loginame 2], [block_orig_loginame]ORDER BY COUNT([entry_no]) DESCGO
SELECT * FROM ssi_BlockLog ORDER BY [entry_no] DESCGO
(Remark: these examples a created for a database called "Navision" – please change on demand)
Solving and Preventing Blocks
Well, we have to be aware that it is impossible to solve all kinds of potential blocking conflicts, as it is impossible to predict all probably – and improbably – circumstances of conflicting processes. Means: there will always be blocks, we only can reduce the problems down to a certain degree. Due to the way how certain processes are designed in NAV – e.g. posting routines – the success is limited to the number of concurrent users running these transactions in parallel. To reduce blocks these optimizations are necessary:
(see my BLOG http://dynamicsuser.net/blogs/stryk/archive/2007/10/28/included-columns-vs-sift-tables.aspx for further details) When SIFT data is updated (aggregation) the system has to establish tight locks to make sure only committed data is included when summing up the figures. The more aggregation levels (buckets) exist, and the higher this aggregation happens, the higher is the risk of encountering blocks. (Remark: With NAV 5.0 SP1 – Indexes Views – this is void)
If the index structure in insufficient usually transactions take much longer because "Index Scans" are performed instead of "Index Seeks". "Scanning" indexes – reading the leaf node level – is time consuming, and has a high risk of getting blocked, e.g. if a write transactions locks a record within the "to-be-scanned" range. Optimizing indexes speeds up transactions and reduced blocking conflicts.
It is important to optimize the "length"/duration of processes and transactions in NAV. The longer a transaction takes, the more locks are engaged (lock escalation), the higher the risk of causing/encountering blocks. Also, the longer transactions take, the more system resources are occupied. Shorting transactions with a COMMIT – at the right time and place! – will release locks and resources.
The faster a transaction could be processed, the lower is the risk of encountering blocks. So speeding up the system in any way (CPU, RAM, Disks, Network, etc.) could be quite feasible, too. When it is about "Index Optimization" especially the design of the Clustered Index could have major impact on blocks! The following should give few examples:
Table 357 "Document Dimension" Clustered Index (Standard): Table ID, Document Type, Document No., Line No., Dimension Code
Looking at the physical structure of the table, the records are actually stored in this way:
In NAV data from T357 is selected by querying the "Dimension" records e.g. from the "Sales Header" and "Sales Line" table.
Process A (simplified code):
DocDim.SETFILTER("Table ID", '36|37'); DocDim.SETRANGE("Document Type", Order); DocDim.SETRANGE("Document No.", 100);
Process B (simplified code):
DocDim.SETFILTER("Table ID", '36|37'); DocDim.SETRANGE("Document Type", Order); DocDim.SETRANGE("Document No.", 200);
In SQL this results in reading – and locking! – the whole range of records from "Header" to "Line". As soon as a write transaction starts, e.g. by posting the Order, "Exclusive Locks" are engaged. Hence, in this example Process A also locks the "Header" records from Order 200, even though it is a different document – a block occurs as soon as Process B tries to post Order 200! This problems gets worse the more documents were created, as the physical distance between "Header" and "Lines" get bigger and bigger …
By forcing an optimized physical order of the records by changing the "Clustered Index" these kind of block could be prevented:
Clustered Index (Optimized): Document No., Document Type, Table ID, Line No., Dimension Code
This optimizations improves the overall performance of posting transactions, they'll perform faster and encounter way less blocks!
Whenever a "Ledger Entry" is written in NAV, the following algorithm is used:
LedgerEntry.LOCKTABLE; LedgerEntry.FINDLAST; NewEntryNo := LedgerEntry."Entry No." + 1;
All these tables have a Primary Key/Clustered Index of "Entry No.". The first problem here is, that the transaction is serialized, means "Dirty Reads" are prevented, thus, if a is written it exclusively locks the last record – other processes are blocked and cannot add another record until the lock has been released. This behavior is intended to grant the consistent sequence in numbering of the records.
FOR ALL KINDS OF FISCALLY RELEVANT TABLES (e.g. G/L Entry, VAT Entry, etc.) THIS MUST NOT BE CHANGED – else you would jeopardize the correctness of the book-keeping!
But there are tables where this "sequence" is not important (e.g. "Change Log Entry", "Warehouse Entry", etc.) where we could improve things. The second problem is the query at all – the FINDLAST – to just retrieve a record to find the next "Entry No." (thankfully the FINDLAST just fetches one record, with the old FIND('+') is was even worse!) By enabling the "AutoIncrement" property of the field "Entry No.", it is not necessary to query the SQL Server for the number – as SQL Server will generate the "Entry No." automatically on INSERT! (see table 405 "Change Log Entry" for example!) But the third – and most important – problem is the physical order of the records.
By default they are store in sequence, new records are added to the end. When a new record is inserted – added at the end of the table – an Exclusive Lock on this row is established (ROW X) and an Intended-Exclusive Lock on the Page (PAG IX). When a second process tries to add another record, the PAG IX prevents that another ROW X is set on this page – the second process gets blocked!
ROW X / PAG IX
4 (New Record)
These conflicts could be reduced by introducing a different "Clustered Index" – using a new field of type GUID (Globally Unique Identifier) (to be created with method CREATEGUID).
Primary Key: "Entry No." (type Integer) Clustered Index: "UID" (type GUID)
5 (New Record)
The GUID values are not created in sequence, so actually records are not added to the end of the table, but inserted in between the records. If the table is big enough – to occupy lots of pages – the probability is very high that every insert happens on a different page, so the PAG IX does no harm anymore!
These two examples should give evidence about the importance of the physical structure of a table! Please regard, that these solution are actually quite contrary: in Example 1 a certain physical order was forced, and in Example 2 a physical dis-order (chaos) was the goal. Two opposite solutions – both to fix blocking problems.
This should point out that it is "tricky business" to find the ideal physical structure for some tables.
(Remark: please have in mind that all the examples above have advantages and disadvantages, so take care when implement this – anyway, the change in T357 is quite failsafe)
[@Mark/Bert: I would appreciate if you could share your experiences on this specific issue! Thanks!]
A sort of "first class" blocking situation is, when a so called "Deadlock" occurs:
Locking Resource #1
Locking Resource #2
A: Lock Grant, B: Lock Grant
A: Lock Wait – Blocked
B: Lock Wait - Blocked
Now we have a situation where two (or more) processes are holding locks, and are trying to also lock the resource from the other process – resulting in both processes are waiting for the other to release the lock.
While with the "native" C/SIDE Server both processes indeed would wait for eternity (at least in older versions), the SQL Server recognizes this situation and automatically resolves it: it chooses one process as "victim" (regarding CPU time, number of pages changed and – theoretically, but not with NAV – the DEADLOCK_PRIORITY) and "kills" that process (= cancel the transaction and rolling it back) so that the other(s) could continue. This usually takes 3 to 5 seconds. While with "normal" blocks a process is just forced to wait, a Deadlock terminates the processing (as LOCKTIMEOUT would do with a block) which could be a problem for e.g. unattended processes (NAS, Job Scheduler) if those cannot resume after the "kill".
Automatic Deadlock Detection
It is common knowledge that with the trace-flags 1204 and 1222 (or 1204 [, 1205] and 3605 in SQL 2000) deadlock information could be written into the SQL Error-Log. Unfortunately it is a real "pain" to analyze this Log, especially if lot of Deadlocks were encountered. With SQL 2005 its much more easier to gather this information!
With the SQL Profiler the "Deadlock Graph" event could be monitored. This DLG is based on a XML structure, which makes analysis simple.
Step 1 – Automatic Profiler Trace
As any SQL Profiler Trace, so a "Deadlock Graph" trace could be scripted. Thus, a TSQL script could be created to automatically start such a trace, e.g. when starting the SQL Agent Service, writing the DLG into a file:
USE [msdb] GO BEGIN TRANSACTION DECLARE @ReturnCode INT SELECT @ReturnCode = 0 IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'STRYK System Improvement' AND category_class=1) BEGIN EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'STRYK System Improvement' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback END DECLARE @jobId BINARY(16) EXEC @ReturnCode = msdb.dbo.sp_add_job @job_name=N'SSI: Deadlock Trace', @enabled=1, @notify_level_eventlog=2, @notify_level_email=0, @notify_level_netsend=0, @notify_level_page=0, @delete_level=0, @description=N'Automatic Deadlock-Trace by STRYK System Improvement, http://www.stryk.info', @category_name=N'STRYK System Improvement', @owner_login_name=N'sa', @job_id = @jobId OUTPUT IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'start trace', @step_id=1, @cmdexec_success_code=0, @on_success_action=1, @on_success_step_id=0, @on_fail_action=2, @on_fail_step_id=0, @retry_attempts=0, @retry_interval=0, @os_run_priority=0, @subsystem=N'TSQL', @command=N'declare @rc int declare @TraceID int declare @maxfilesize bigint set @maxfilesize = 500 exec @rc = sp_trace_create @TraceID output, 0, N''C:\ssi_Deadlock_Trace'', @maxfilesize, NULL if (@rc != 0) goto error declare @on bit set @on = 1 exec sp_trace_setevent @TraceID, 148, 11, @on exec sp_trace_setevent @TraceID, 148, 12, @on exec sp_trace_setevent @TraceID, 148, 14, @on exec sp_trace_setevent @TraceID, 148, 1, @on declare @intfilter int declare @bigintfilter bigint exec sp_trace_setstatus @TraceID, 1 select TraceID=@TraceID goto finish error: select ErrorCode=@rc finish: go ', @database_name=N'master', @flags=4 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'Auto', @enabled=1, @freq_type=64, @freq_interval=0, @freq_subday_type=0, @freq_subday_interval=0, @freq_relative_interval=0, @freq_recurrence_factor=0 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)' IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback COMMIT TRANSACTION GOTO EndSave QuitWithRollback: IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION EndSave:
Here the DLG are written to C:\ssi_Deadlock_Trace.trc.
THAT'S IT – AUTOMATIC DEADLOCK DETECTION ESTABLISHED!!!
To analyze the Deadlocks we could proceed like this:
Step 1 – Extract Deadlock Graphs
Once the Deadlock information has been gathered, the XML files could be extracted from the TRC file: Open TRC file in SQL Profiler and select "File – Export – Extract SQL Server Events – Extract Deadlock Events" (hint: ideally write all DLG into one XDL file).
Step 2 – Analyze XML
The extracted "Deadlock Graphs" are now saved in a XDL file (XML DeadLock). So, by simply renaming it into XML the data could be fed to ANY application or program that could deal with XML – this could be MS Excel in the easiest case! Here we could also learn the "Who is Who about Deadlocks", we could determine the processes involved into deadlocks and the resources (tables, indexes, etc.) which were affected. And we can COUNT the problems!
Solving and Preventing Deadlocks
In addition to the general advice to prevent/avoid blocks (see above) here the sequence of locking is important, too. A deadlock is the result of a kind of "cross-over-lock", hence if both (all) processes lock the resources in the same sequence a deadlock could not occur. This "sequence" could be defined by C/AL Code – when to lock which records, tables, etc. – or from an organizational point, by avoiding potentially conflicting processes (workflow optimization).
(Remark: I strongly recommend to check out the "NAV Troubleshooting Tools & Guide" for detailed information about how to define and verify locking orders and deadlock potentials)
Finally, an advice you'll find in some documentation from MS: "Avoid Deadlocks by locking a common master-resource". This means, that in the very beginning of a transaction this "master resource" is hard-locked, so all other processes are queued – blocked. Of course this total serialization prevents deadlocks, but actually at cost of any parallelism in the system – and high performance also means to have a high degree of parallelism! This option should be the very last line of defense … NOT recommended !
If Index- and SIFT-Optimization is done, it is primarily a matter of the C/AL Code to avoid blocks and deadlocks. Here even the "standard" NAV code has a pretty high potential of conflicts, which is quite difficult – sometimes impossible – to improve. Due to the way especially posting processes are designed in NAV there will always be blocks, it is just a matter of number of concurrent transactions and users, but sooner or later the limit would be reached …
Those issues which could be fixed could require a lot of "attention", it could be quite time consuming to implement the changes and fixes. To spend this time as efficient as possible it is crucial to precisely know where the blocks/deadlock are coming from – the "Who is Who" – and especially to count how often a problem occurs, to determine which are the worst problems to be fixed first, where the solution promises the highest gain in performance …
At last some advertisement: the solutions from above are taken from the "NAV/SQL Performance Toolbox". Here, just the "light" versions are published, the real tools are more sophisticated, of course!
Puuuuhhh … that was a long one … hope I could give you some valuable inspiration to deal with these matters … I appreciate your feedback, especially about how you're dealing with "Blocks & Deadlocks"!
### Changes 03.11.2008 ### >>>
I'm still getting a lot - throughout positive - response on this; also lots of recommendations how to further improve the process. So, I have updated the Code snippets, providing a further improved way of "Block Detection", e.g. getting the "waitresource" was enhanced and finding the "block originator" was added. (Thanks a lot to Rama Mathanmohan for his support!)
### Changes 03.11.2008 ### <<<
### Changes 02.02.2009 ### >>>
Minor enhancements in "Table" and "Procedure", now also saving the "Execution Plan".
### Changes 02.02.2009 ### <<<
CAUTION: Any Registry changes have to be handled with care!
I am new to this. Since I am coming from a Microsoft SQL Server eneonirmvnt, and to some other it may seem like the two versions should be very similar, and they are, to a certain degree, but they are also very different in some important and basic ways. Good read, thanks!
Little update (29.01.2010):
The "Alert" was improved (thanks to Oleksandr C. for the hint!) and I added a feature to change the Registry setting for the "Performance Sampling Interval".
Well, if you don't see blocks on SIFT tables you could be happy - but this does not necessarily mean your SIFT structure is optimized - there's always potential for improvement ;c)
Regarding the "Scan or Seek": YES, you can! I've updated the code examples, now the "Execution Plans" are saved, too!
To troubleshoot a process I count the occurring problems. Then I look into processes which ae causing lots of blocks (originator), but also those which suffer a lot from blocks (blocked user); thus I try to approach the problem from multiple sides ...
Great job on this tool - it seems fantastic! I'm not sure if I fully understand it's output though.
If SIFT is needs to be optimized, would we expect the wait resource to be one of the SIFT tables? If we don't see any SIFT tables as wait resources, does that mean the SIFT may be okay?
With your tool's output, can we tell if we have an index scan or seek?
When optimizing a process, it seems like we should find the login that is most frequently the "original blocking user" and find out what that user was doing. Then use the client monitor run the process that they are running, and use it's output to figure out how to optimize it. Is that right?
Thank you very much for this valuable contribution to the community.
@Bruno Johansson: WIth "Always Rolock" enabled the system will create remarkably more lock requests, SQL Server has to maintain much more locks than without forcing ROWLOCK. These locks are maintained in the "directly adressable memory space"; this is in a 32bit system the first 4 GB of RAM - net 2.7GB. The same address space is used for lot of other internal processes, thus a problem could arise if this RAM space is overloaded! Adding more RAM is pointless in this case ...
With a 64bit system 18.000.000.000 GB could be directly adressed, hence SQL Server could use all available RAM for maintaining locks, etc. - if there is sufficient hardware, "Always Rowlock" would not harm ...
@Jwilder: Well, so far I have throughout positive experiences by simply changing the "SQL Index" property of the PK!
Great write-up Jörg, thanks for sharing.
My thoughts on your comments about Always Rowlock is that if the only drawback is that I need more hardware then I would recommend my client to add the hardware since hardware is cheap. I can add 8GB of server RAM for approx. $600 and that kind of money does not go far in paying for highly skilled consulting services for analyzing code, testing the changes made and implementing them in both a TEST and PROD environment, etc. Same goes for adding more disks, etc.
Is that an incorrect approach in your mind?
Thanks again :)
Very interesting topic especially in regards to blocking and table 357. What is the best way to implement this?
1) Change SQL Index property of primary key
2) Create new key and mak it the clustered index
Hi, Jörg Stryk
Thanks for you explanation. With the autoincrement the writen mode its lot easier and faster, because we have less queries.
I think it's a good idea and it could solve many problems with the lock. Using the GUI in the tables standard another good idea, and I hope Microsoft make this changes quickly to solve one of our great problems.. :)
@Ricardo: Hmmm, it's somewhat difficult to explain this in writing ...
The idea behind "GUID" is this:
First, with using AutoIncrement you don't need to do the LOCKTABLE/FINDLAST stuff, thus you skip the queries - less load - and you don't have to serialize the transaction (reducing blocking conflicts).
So far, so good, but there might be stll a problem when writing data: with "Entry No." as CI (= physical sorting of recs) data is always written to the end of the table, thus there coould be a conflict if more processes want to write to the last page of this table.
GUID strings are looking like random figures, they are not created in sequence - and they are unique and could be used as PK, too.
With creating a CI on basis of a GUID field you actually force the system that data is always written on a different page - not always at the last page (as before = blocking) but on pages anywhere inbetween (less locking).
That is just the idea of it, please be aware that there can be lots of other things to regard: e.g. Index size (much larger with GUID), Fillfactors, Reading issues, disabling old "Entry No.", etc.
@jmetz: Thanks for the corrections!
realy good session, but some comments:
- your Deadlock example has a little mistake:
Process B locks first "Locking Resource #2" and then he should lock "Locking Resource #1".
- You said: "While with the "native" C/SIDE Server both processes indeed would wait for eternity"
-> the "native" C/SIDE Server did resolve the deadlock situation also.
Your comments and suggestions are very useful and important for a good performance in navision.
Could you explain to me in a little detail the use of GUID in a clustered index, like you suggest? I didn't understand how you use it.
When you put a record in a ledger entry you use the findlast without any filters to get the next entry no. for example:
entryno := ledgentry."entry no."; -- this is what the navision standard does
ledgentry."entry no." := entryno;
With the GUID what is the difference you suggest? Sorry my confusion about this issue.
And also a very good session, as I already told him in person .
I'll need to take a day off to read all of that, but thank you very much for the time and effort Jorg, and for sharing it with us, this is great work.