“Waits, Waits, Do Tell Me”* – how to analyze bottlenecks in SQL Server

This post has been republished via RSS; it originally appeared at: SQL Server Support articles.

First published on MSDN on Dec 10, 2012

“Waits” in SQL Server are often a major source of performance issues. "Waits" as they are known in the SQL Server world essentially indicate that a worker thread inside SQL Server is waiting for a resource to become available before it can proceed with executing. See, a thread is tasked with executing code, it blossoms and blooms, it is “alive” when it is executing code on the CPU. “Waits” on the other hand are stagnation, obstruction, delay – preventing a thread from “realizing its full potential” executing the task it is given. Therefore, waits are the "opposite" of execution, i.e. when a SQL Server thread is waiting (stuck behind a resource), it is necessarily not executing (running code on the CPU). When you think of “bottlenecks” in terms of performance, “waits” is what you are really thinking about: “where are things getting stuck?”, “where is the road becoming narrower?”, “where is the neck of the bottle?”

 

From my comparative experience with other software products (outside the database world), I think we are very fortunate that SQL Server is transparent about the roadblock it encounters. SQL Server uses Wait Types (scroll down in that article to view entire table) to show us the particular bottleneck a thread is waiting on. Wait types themselves have been discussed extensively in the blogosphere, so I won’t delve into it.

 

One key element that is sometimes omitted in bottleneck analysis is the “wait time”. Knowing what type of bottleneck a SQL thread is behind is indeed important. But how long the wait lasted is equally important. For example, a wait for 15 ms. is not something I lose sleep over, but a wait that lasts 5000 ms certainly gets my attention, especially if the same session is reporting it multiple times in the lifetime of a query execution.

 

Armed with this knowledge, you are ready to collect some data, aren't you? There are 4 data points, so to speak, that may be of interest:

 

1. Overall waits at the SQL Server – helps with understanding trends and what are most common waits

 

2. Individual sessions waits captured via periodic snapshots against sys.dm_exec_requests – helps with waits on specific sessions (for trending purposes) and overall idea where waits are occurring.

 

3. All waits occurring in a particular session – helps with exhaustive analysis of all the waits for a session

 

4. All waits occurring across all sessions – helps with exhaustive analysis of all waits occurring on all sessions. This is a fairly expensive data collection in terms of server impact and it terms of huge amounts of data that need to be analyzed. I generally discourage people from doing this.

 

1. Overall Waits at the SQL Server Level

 

(least specific but least expensive)

 

To understand the bottleneck trends in SQL Server, one can use Wait statistics Perfmon counters (in SQL Server 2008 or later for most common wait types) and DMV ( sys.dm_os_wait_stats )

 

You can find a technique to capture this type of data on Mark House’s blog . But essentially, the simplest for querying the DMV can be done this way:




declare @logtime datetime
set @logtime = getdate ()
select @logtime logtime, * into #wait_stats from sys.dm_os_wait_stats
waitfor delay '00:05:00'  --wait for 5 min and then capture another snapshot so that you can compare
set @logtime = getdate ()
insert into #wait_stats select @logtime,* from sys.dm_os_wait_stats


Now you can view the wait type difference and determine the longest-lasting bottlenecks. Be sure you exclude the commonly-ignorable wait types in your analysis.


select wait_type, (max(wait_time_ms)-MIN(wait_time_ms))/60000 DeltaMin, DATEDIFF(MINUTE,min(logtime), max(logtime)) as CollectionPeriodForThisWaitType,
(max(wait_time_ms)-MIN(wait_time_ms))/600.00 / DATEDIFF(MINUTE,min(logtime), max(logtime)) as PercentOfCollectionTIme
from #wait_stats
group by wait_type
order by DeltaMin desc

 

2. Individual sessions waits captured periodically via snapshots of DMVs

 

(more specific, but more expensive)

 

This method essentially captures periodic snapshots from DMVs such as sys.dm_exec_requests and shows what specific wait types each session is experiencing at the time of the capture. Thus if you capture multiple snapshots over time, you can observe somewhat of a trend and make inferences as to what is the predominant bottleneck for each session (or all sessions). This does allow you to focus on a specific session and therefore narrow your troubleshooting in case you know which query is problematic.

 

I don’t need to reinvent the wheel on this as far as scripts are concerned. The PSSDIAG tool does an excellent job collecting data, if you enable “SQL 2005 Perf Stats” or “SQL 2008 Perf Stats” (depending on SQL version) custom Diagnostic (on by default)

 

Also, you can manually collect the same data, without using PSSDIAG by running the Performance Statistics scripts yourself

 

In essence, though, these scripts perform a task similar to the following and you can use the following script to do the work:




set nocount on
while (1=1)
begin
select  session_id,wait_type,wait_time,text
from sys.dm_exec_requests r cross apply sys.dm_exec_sql_text(sql_handle)
waitfor delay'00:00:05'
end

 

 

3. All waits occurring in a particular session

 

(most precise but most expensive)

 

If you have narrowed the problem down to a specific query and you have indications that the query is slow due to waits then you can capture the entire set of wait types and corresponding wait durations for the lifetime of a query using Extended Events. This method works best if you can reproduce the issue at will.







Capture an Extended Event trace

drop eventsession AllWaitsForSingleSession on server
go
create event session AllWaitsForSingleSession on server
add event sqlos.wait_info
(action (sqlserver.sql_text)
where sqlserver.session_id=52 --NEED TO change SESSION_ID that matches your repro
--and duration >=5000 -- you may choose to only capture waits above a certain duration threshold
and opcode= 1 ),
add event sqlos.wait_info_external
(action (sqlserver.sql_text)
where sqlserver.session_id=52 --NEED TO change SESSION_ID that matches your repro
--and duration>=5000 --you can specify to capture only waits above 5 sec
and opcode= 1 )
add target package0.asynchronous_file_target (set filename=N'c:\waits.xel')
go
alter event session AllWaitsForSingleSession on server state=start



Reproduce Issue: Run the problem query now on session 52 (for example). Once the query completes, you can stop the Event session



alter event session AllWaitsForSingleSession on server state=stop

Data Analysis: Replace the XEL  file names

select * into #xe1 from sys.fn_xe_file_target_read_file ('C:\waits*.xel',null,null,null) as t
go
WITH xmldata([object_name],event_data)
as ( select [object_name], convert(xml,event_data) from #xe1 )
select event_data.value('( /event/@name)[1]' , 'varchar(40)' ) as EventName,
event_data.value('( /event/@timestamp)[1]' ,'datetime') as EventTime,
event_data.value('(/event/data/text)[1]','varchar(32)') as WaitType,
event_data.value('(/event/data/text)[2]','varchar(16)') as OpCode,
event_data.value('(/event/data/value)[3]','int') as Duration,
from xmldata

Overall, addressing waits when troubleshooting SQL performance problems will be the most common path to resolving such problems as waits are the most common cause for slow performance.




Namaste!

Joseph

* This is a play on "Wait, wait don't tell me" - NPR show

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.