​​​Let’s make it easy! How to read System Health and sp_server_diagnostics?​

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

 


Hi Everyone,


The system_health session is an extended event session that is included by default with SQL Server. This blog post goes over how to easily read and interpret System Health extended event and sp_server_diagnostics with an example. For this post, we will focus on what sp_server_diagnostics reports during the period of 3/15/2023 at 4:00am to 3/15/2023 at 11:20am.

 

Let us say you are being approached to figure out the reason for a glitch during this time window, even if you did not have any monitoring solution in place, you can still tap into system_health session since this is available by default on your SQL Server instance. 

 

1.0 Overview of system_health extended event and sp_server_diagnostics 

 

1.1 What is inside system health xevent? 

 

  • The sql_text and session_id for any sessions that encounter an error that has a severity >= 20. Database Engine error severities - SQL Server | Microsoft Learn
  • The sql_text and session_id for any sessions that encounter a memory-related error. The errors include 17803, 701, 802, 8645, 8651, 8657 and 8902.
  • A record of any non-yielding scheduler problems. These appear in the SQL Server error log as error 17883.
  • Any deadlocks that are detected, including the deadlock graph.
  • The callstacksql_text, and session_id for any sessions that have waited on latches (or other interesting resources) for > 15 seconds.
  • The callstacksql_text, and session_id for any sessions that have waited on locks for > 30 seconds.
  • The callstacksql_text, and session_id for any sessions that have waited for a long time for preemptive waits. The duration varies by wait type. A preemptive wait is where SQL Server is waiting for external API calls.
  • The callstack and session_id for CLR allocation and virtual allocation failures.
  • The ring buffer events for the memory broker, scheduler monitor, memory node OOM, security, and connectivity.
  • System component results from sp_server_diagnostics.
  • Instance health collected by scheduler_monitor_system_health_ring_buffer_recorded.
  • CLR Allocation failures.
  • Connectivity errors using connectivity_ring_buffer_recorded.
  • Security errors using security_error_ring_buffer_recorded.

 

 

 

1.2 What is inside the output of the internal stored procedure: Sp_Server_Diagnostics

 

Since the stored procedure sp_server_diagnostics is internal to SQL Server and cannot be scripted out, the section below expands on how to read the data column of each column from the output of this stored procedure and how to interpret the results.

The output of the stored procedure sp_server_diagnostics is logged in both system_health and sqldiag xevent sessions with different frequencies.

For System_health, the frequency is 5 minutes, and for SQLDiag, the default is 10 seconds with 30 seconds as the health check timeout.  

Data captured by sp_server_diagnostics. (Reference: sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn) 

 

 

Column

Data type

Description

create_time

datetime

Indicates the time stamp of row creation. Each row in a single rowset has the same time stamp.

component_type

sysname

Indicates whether the row contains information for the SQL Server instance level component or for an Always On availability group:

System:

Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage. This information is producing an overall health state recommendation.

 

Resource:

Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. This information produces an overall health state recommendation.

 

query_processing:

Collects data from a query-processing perspective on the worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks. This information produces an overall health state recommendation.

 

IO_subsystem:

Collects data on IO. In addition to diagnostic data, this component produces a clean healthy or warning health state only for an IO subsystem.

 

Events:

Collects data and surfaces through the stored procedure on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity . Events will always show 0 as the state.

 

<name of the availability group>:

Collects data for the specified availability group (if component_type = "Always On:AvailabilityGroup").

 

component_name

sysname

Indicates the name of component or the name of the availability group:

System, Resource, Query_processing, Io_subsystem, Events or <name of the availability group>

state

int

Indicates the health status of the component:
0 1 2 3

state_desc

sysname

Describes the state column. Descriptions that correspond to the values in the state column are:
0: Unknown

1: clean

2: warning

3: error

data

varchar (max)

Specifies data that is specific to the component.

 

 

 2.0 How to open multiple extended events on SSMS (SQL Server Management Studio)? 

 

  • Open SSMS and click on file> open> Merge extended event files.

JPSal_18-1680097609222.png

 

  • Next you will see a pop click on “add”. 
  • Select all the xevents for system_health (They can be captured with a tool like TSS, PSSDAIG, or SQL LogScout or you can also find the system_health xel files on the same log folder as the SQL errorlogs) 
  • Click “open” and then “ok”. 

 

JPSal_1-1679610242175.png

 

 

3.0 How to analyze the merged extended event files? 

Since the extended event has a running log of entries, the first thing to do is to apply a time filter to narrow the log entries down to the time of the issue. 

So, when you open the xevent files on the ribbon click on filter.  

 

JPSal_2-1679610242177.png

 

A popup will show up, click on set time filter> set the start time and end time to the hours that you want to review. 

 

Important!!!The xevent timestamp will show with your time zone, you will need to convert the time you can check the SQL error log to see the time zone. You also have the option to add the column timestamp (UTC). To add this column right click on the columns > click on choose column> look for timestamp (UTC) 

 

JPSal_3-1679610242179.png

 

JPSal_4-1679610242180.png

 

For this demo, we will only focus on the output from sp_server_diagnostics  and since the issue in this example happened at 6 am EST (UTC-4) and continue happening for sometime, we will start looking for events starting from 4 am (UTC -6) until 11:21 am as shown on the screenshot below. Next on the bottom we can use the filter “Name = sp_server_diagnostics_component_result” and “state not like clean”  

 

JPSal_5-1679610242181.png

 

After filtering it will give you a list of components that do not have a clean state. You can add the field as a column by right clicking the field and “show column in table.” 

 

JPSal_6-1679610242182.png

 

 From here we will look at each component and what they are reporting. You can group by component and sort by time. To group by right click on the column and click on group by this column.   

 

JPSal_7-1679610242184.png

 

And it will look something like this.

 

JPSal_8-1679610242187.png

 

4.0 What does each component report? 

Now that we have the data that we want to look at least see what each component reports on the data column. This column is on varchar (max)

Some of the counters are very straight forward some are not, but here are some conditions that would cause warnings or errors.

 

 

Component

HealthState

LogicalOperation

Check

Operator

Value

Notes

SYSTEM

WARNING

 

DumpCount

0

Are there any Dumps?

SYSTEM

WARNING

OR

LatchWarnings

3

Having issues with Latches

SYSTEM

WARNING

OR

SpinLockType

!=

SP_LAST [special spinlock]

Sick Spinlock

SYSTEM

WARNING

OR

BadPagesDetected

0

Bad memory?

SYSTEM

ERROR

 

DumpCount

100

Over 100 dumps?

SYSTEM

ERROR

AND

CurrentDumpCount - LastDumpCount

0

Have there been ANOTHER dump?

SYSTEM

ERROR

AND

CurrentDumpCount

LastDumpCount

Same as above

SYSTEM

ERROR

OR

MemoryScribblers

3

Has memory been overwritten by another process?

SYSTEM

ERROR

OR

SpinLockTypeAfterAv

!=

SP_LAST [special spinlock]

Spinlock Type after AV

RESOURCE

ERROR

 

ProcessOOM

120 Seconds

OOM for 2 minutes

RESOURCE

WARNING

 

PoolOOM

=

TRUE

Any pool out of memory

RESOURCE

WARNING

OR

LastResourceNotification

=

MEMPHYSICAL_LOW

Low physical memory

RESOURCE

WARNING

OR

LastResourceNotification

=

MEMVIRTUAL_LOW

Low virtual memory

RESOURCE

WARNING

OR

LastOOMExceptionCount

OOMExceptionCount

Have OOM exceptions?

RESOURCE

WARNING

OR

ProcessOOMCount

!=

0

Is the process itself out of memory

QUERYPROCESSOR

WARNING

 

PendingTasksCount

0

Pending tasks on schedulers

QUERYPROCESSOR

WARNING

OR

NonYieldTrackiongScheduler

!=

NULL

Is there a non-yielding scheduler

QUERYPROCESSOR

ERROR

 

isUnresolvedDeadlock

=

TRUE

Unresolvable deadlock?

QUERYPROCESSOR

ERROR

OR

IsDeadlockedSchedulers

=

TRUE

Any schedulers deadlocked?

IO

WARNING

 

LongIOs

0

Must be current - previous > 0 for warning

 

 

 

4.1 Component: SYSTEM 

Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage.

The categories of information that are surfaced for the "system" component data are as follows:

 

  • CPU utilization (sqlCpuUtilization, systemCpuUtilization)
  • Page Faults (pageFaults)
  • Non-Yielding tasks (nonYieldingTasksReported)
  • Dumps (intervalDumpRequests, totalDumpRequests)
  • Access Violations (writeAccessViolationCount, isAccessViolationOccurred )
  • Latches (latchWarnings)
  • Spinlock activity (spinlockBackoffs, sickSpinlockType, sickSpinlockTypeAfterAv )

 

<system spinlockBackoffs="0" sickSpinlockType="none" sickSpinlockTypeAfterAv="none" latchWarnings="0" isAccessViolationOccurred="0" writeAccessViolationCount="0" totalDumpRequests="6" intervalDumpRequests="0" nonYieldingTasksReported="0" pageFaults="1438916" systemCpuUtilization="19" sqlCpuUtilization="14" BadPagesDetected="0" BadPagesFixed="0" LastBadPageAddress="0x0"/>

 

On the above example we can the following

 

  • We have totalDumpRequests="6" and pageFaults="1438916"
  • We can see that systemCpuUtilization="19" and sqlCpuUtilization="14" of the total 19.
  • We can conclude that we have high pageFaults="1438916" and totalDumpRequests="6" which should be always 0.
  • Page fault will not trigger a warning or error by itself but can give you an indication of what is going on with the server.
  • A page fault occurs when a program requests an address on a page that is not in the current set of memory resident pages.  What happens when a page fault occurs is that the thread that experienced the page fault is put into a Wait state while the operating system finds the specific page on disk and restores it to physical memory. We need to look at the memory, PLE and disk to confirm this because we would also get a soft page fault. https://techcommunity.microsoft.com/t5/ask-the-performance-team/the-basics-of-page-faults/ba-p/373120
  • To track paging, you should use the following counters:  “Memory\ Page Faults /sec”, “Memory\ Cache Faults /sec” and “Memory\ Page Reads /sec”.  The first two counters track the working sets and the file system cache.  The Page Reads counter allows you to track hard page faults.  If you have a high rate of page faults combined with a high rate of page reads (which also show up in the Disk counters) then you may have an issue where you have insufficient RAM given the high rate of hard faults.

 

 

4.2 Component: RESOURCE 

Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. The data column output is divided into three areas of interest. 

 

The initial area of focus pertains to overall resource monitoring.

 

<resource lastNotification="RESOURCE_MEMPHYSICAL_LOW" outOfMemoryExceptions="0" isAnyPoolOutOfMemory="0" processOutOfMemoryPeriod="0">

 

 

The second segment of data derived from the "resource" component pertains to a memory report that encompasses Process and System counters and counts.

 

<memoryReport name="Process/System Counts" unit="Value">

<entry description="Available Physical Memory" value="49261801472"/>

<entry description="Available Virtual Memory" value="139206615916544"/>

<entry description="Available Paging File" value="48387641344"/>

<entry description="Working Set" value="483305889792"/>

<entry description="Percent of Committed Memory in WS" value="100"/>

<entry description="Page Faults" value="1809733786"/>

<entry description="System physical memory high" value="1"/>

<entry description="System physical memory low" value="0"/>

<entry description="Process physical memory low" value="0"/>

<entry description="Process virtual memory low" value="0"/></memoryReport>

 

The third memory report sourced from the "resource" component data comprises Memory Manager counter values.

 

<memoryReport name="Memory Manager" unit="KB">

<entry description="VM Reserved" value="1488779088"/>

<entry description="VM Committed" value="471862324"/>

<entry description="Locked Pages Allocated" value="0"/>

<entry description="Large Pages Allocated" value="0"/>

<entry description="Emergency Memory" value="1024"/>

<entry description="Emergency Memory In Use" value="16"/>

<entry description="Target Committed" value="471859208"/>

<entry description="Current Committed" value="471862328"/>

<entry description="Pages Allocated" value="451123768"/>

<entry description="Pages Reserved" value="361480"/>

<entry description="Pages Free" value="1376480"/>

<entry description="Pages In Use" value="106168864"/>

<entry description="Page Alloc Potential" value="342097376"/>

<entry description="NUMA Growth Phase" value="2"/>

<entry description="Last OOM Factor" value="0"/>

<entry description="Last OS Error" value="0"/></memoryReport></resource>

 

Last Notification looks to be the memory manager’s latest report.

 

  • RESOURCE_MEM_STEADY: No memory issues were observed.
  • RESOURCE_MEMPHYSICAL_HIGH: SQL Server can acquire memory because OS has sufficient free available memory.
  • RESOURCE_MEMPHYSICAL_LOW: SQL Server must return the memory as OS is having memory shortage.
  • RESOURCE_MEMVIRTUAL_LOW: The virtual address range for the SQL Server process is getting exhausted.

Some observations:  

  • In our example we have lastNotification="RESOURCE_MEMPHYSICAL_LOW" telling us that SQL Server cannot further consume more memory or SQL Server must return the memory as OS is having memory shortage. 
  • For our case we can see that the system does have memory "System physical memory high" value="1"/>  
  • But SQL Server has reached its limit "Target Committed" value="471859208"/> "Current Committed" value="471862328"/> 

 

 

4.3 Component: Query 

There are five helpful groupings of statistics into which the data can be categorized:

 

·        Query processing statistics

·        Top non-preemptive wait statistics by count

·        Top non-preemptive wait statistics by duration

·        Top preemptive waits by count

·        Top preemptive waits by duration

 

The initial segment will comprise data related to query processing statistics.

 

Returning a point of time measurement of max workers, actual workers created, idle workers, pending tasks, oldest pending task waiting time (I’ll be paying attention to this one) and scheduler deadlock and non-yielding counts.

 

<queryProcessing maxWorkers="960" workersCreated="579" workersIdle="402" tasksCompletedWithinInterval="1284604" pendingTasks="1" oldestPendingTaskWaitingTime="0" hasUnresolvableDeadlockOccurred="0" hasDeadlockedSchedulersOccurred="0" trackingNonYieldingScheduler="0x0">

 

The second part we have waits. Preemptive and nonPreemptive

 

  • Preemptive: refers to the type of multitasking where the operating system interrupts an executing task to allocate the CPU to another task
  • Non-Preemptive: refers to the type of multitasking where the CPU is allocated to a task and the task holds the CPU until it finishes or voluntarily relinquishes it.

 

We have four sets of results for the Top 10 wait stats.

Here is the output of the top non-preemptive waits, sorted by count based on the "waits" attribute.

 

<topWaits><nonPreemptive>

<byCount>

<wait waitType="MEMORY_ALLOCATION_EXT" waits="14771512961" averageWaitTime="0" maxWaitTime="2852"/>

<wait waitType="RESERVED_MEMORY_ALLOCATION_EXT" waits="1058047487" averageWaitTime="0" maxWaitTime="2100"/>

<wait waitType="ASYNC_NETWORK_IO" waits="493414963" averageWaitTime="0" maxWaitTime="5536"/>

<wait waitType="HADR_WORK_QUEUE" waits="245076271" averageWaitTime="62" maxWaitTime="15388"/>

<wait waitType="WRITELOG" waits="83482761" averageWaitTime="1" maxWaitTime="4632"/>

<wait waitType="PAGELATCH_EX" waits="49895014" averageWaitTime="0" maxWaitTime="1795"/>

<wait waitType="PARALLEL_REDO_WORKER_WAIT_WORK" waits="40554164" averageWaitTime="8" maxWaitTime="4790"/>

<wait waitType="PAGEIOLATCH_SH" waits="26346072" averageWaitTime="1" maxWaitTime="3287"/>

<wait waitType="UCS_SESSION_REGISTRATION" waits="17792927" averageWaitTime="0" maxWaitTime="161"/>

<wait waitType="PAGEIOLATCH_EX" waits="11518408" averageWaitTime="2" maxWaitTime="5653"/></byCount>

 

Here is the output of the top non-preemptive waits, sorted by duration.

 

<byDuration>

<wait waitType="HADR_WORK_QUEUE" waits="245076271" averageWaitTime="62" maxWaitTime="15388"/>

<wait waitType="VDI_CLIENT_OTHER" waits="451646" averageWaitTime="13548" maxWaitTime="60009"/>

<wait waitType="CLR_AUTO_EVENT" waits="35196" averageWaitTime="16700" maxWaitTime="10741940"/>

<wait waitType="PARALLEL_REDO_WORKER_WAIT_WORK" waits="40554164" averageWaitTime="8" maxWaitTime="4790"/>

<wait waitType="HADR_CLUSAPI_CALL" waits="2424632" averageWaitTime="121" maxWaitTime="868"/>

<wait waitType="FT_IFTSHC_MUTEX" waits="188496" averageWaitTime="1560" maxWaitTime="1221459"/>

<wait waitType="HADR_FILESTREAM_IOMGR_IOCOMPLETION" waits="585054" averageWaitTime="502" maxWaitTime="980"/>

<wait waitType="HADR_TIMER_TASK" waits="663915" averageWaitTime="442" maxWaitTime="20004"/>

<wait waitType="HADR_LOGCAPTURE_WAIT" waits="111501" averageWaitTime="2194" maxWaitTime="5044"/>

<wait waitType="ASYNC_NETWORK_IO" waits="493414963" averageWaitTime="0" maxWaitTime="5536"/></byDuration></nonPreemptive>

 

Here is the output of the top preemptive sorted by waits count.

 

<preemptive>

<byCount>

<wait waitType="PREEMPTIVE_XE_CALLBACKEXECUTE" waits="28305259" averageWaitTime="0" maxWaitTime="7"/>

<wait waitType="PREEMPTIVE_OS_AUTHENTICATIONOPS" waits="1286378" averageWaitTime="0" maxWaitTime="50"/>

<wait waitType="PREEMPTIVE_OS_GENERICOPS" waits="1083014" averageWaitTime="0" maxWaitTime="15006"/>

<wait waitType="PREEMPTIVE_OS_FILEOPS" waits="731347" averageWaitTime="0" maxWaitTime="2280"/>

<wait waitType="PREEMPTIVE_OS_CLOSEHANDLE" waits="267390" averageWaitTime="0" maxWaitTime="47"/>

<wait waitType="PREEMPTIVE_OS_AUTHORIZATIONOPS" waits="259924" averageWaitTime="0" maxWaitTime="46"/>

<wait waitType="PREEMPTIVE_OS_REVERTTOSELF" waits="255969" averageWaitTime="0" maxWaitTime="39"/>

<wait waitType="PREEMPTIVE_OS_QUERYCONTEXTATTRIBUTES" waits="255955" averageWaitTime="0" maxWaitTime="47"/>

<wait waitType="PREEMPTIVE_OS_DELETESECURITYCONTEXT" waits="249887" averageWaitTime="0" maxWaitTime="409"/>

<wait waitType="PREEMPTIVE_OS_LOOKUPACCOUNTSID" waits="181358" averageWaitTime="0" maxWaitTime="108"/></byCount>

 

 

Here is the output of the top preemptive sorted by duration.

 

<byDuration>

<wait waitType="PREEMPTIVE_OS_WRITEFILEGATHER" waits="698" averageWaitTime="1460" maxWaitTime="105279"/>

<wait waitType="PREEMPTIVE_OS_FILEOPS" waits="731347" averageWaitTime="0" maxWaitTime="2280"/>

<wait waitType="PREEMPTIVE_OS_GENERICOPS" waits="1083014" averageWaitTime="0" maxWaitTime="15006"/>

<wait waitType="PREEMPTIVE_OS_AUTHENTICATIONOPS" waits="1286378" averageWaitTime="0" maxWaitTime="50"/>

<wait waitType="PREEMPTIVE_OS_DELETESECURITYCONTEXT" waits="249887" averageWaitTime="0" maxWaitTime="409"/>

<wait waitType="PREEMPTIVE_OS_WRITEFILE" waits="57570" averageWaitTime="2" maxWaitTime="1211"/>

<wait waitType="PREEMPTIVE_OS_LOOKUPACCOUNTSID" waits="181358" averageWaitTime="0" maxWaitTime="108"/>

<wait waitType="PREEMPTIVE_OS_QUERYREGISTRY" waits="99563" averageWaitTime="0" maxWaitTime="76"/>

<wait waitType="PREEMPTIVE_XE_CALLBACKEXECUTE" waits="28305259" averageWaitTime="0" maxWaitTime="7"/>

<wait waitType="PREEMPTIVE_OS_CREATEFILE" waits="13676" averageWaitTime="2" maxWaitTime="314"/>

</byDuration></preemptive></topWaits>

 

You will also get intensive request session.

 

<cpuIntensiveRequests>

<request sessionId="1664" requestId="2" command="SELECT" taskAddress="0x2ce3f833848" cpuUtilization="99" cpuTimeMs="1266"/>

<request sessionId="1309" requestId="2" command="SELECT" taskAddress="0x2e9153c3088" cpuUtilization="99" cpuTimeMs="906"/>

<request sessionId="452" requestId="0" command="SELECT" taskAddress="0x30ec462cca8" cpuUtilization="99" cpuTimeMs="15"/>

<request sessionId="552" requestId="2" command="SELECT" taskAddress="0x2f007dab848" cpuUtilization="98" cpuTimeMs="10219"/>

<request sessionId="969" requestId="2" command="SELECT INTO" taskAddress="0x2f165f424e8" cpuUtilization="98" cpuTimeMs="875"/>

</cpuIntensiveRequests>

 

Next will be pending task.

 

<pendingTasks><entryPoint name="Process Command" count="1"/></pendingTasks>

 

And in case there were any blocking happening it will show up here.

 

<blockingTasks></blockingTasks></queryProcessing>

 

 

For our example we can see that we have one pending task pendingTasks="1" This means the scheduler was hung and 1 task were waiting on it to get a scheduler and run.

 

SQL Server pending tasks refer to a set of activities that are currently waiting to be executed by the SQL Server but are temporarily blocked due to various reasons such as resource contention, I/O latency, or other dependencies. These tasks are queued in a pending state until the resources become available to process them further. The pending tasks can be monitored and managed using various performance monitoring and troubleshooting tools in SQL Server.

 

4.4 Component: IO 

We have the I/O component here it will be very hard to get a warning, but we can see if there were any latch timeouts and long I/Os.

Long I/Os are considered when it takes 15 seconds or longer.

 

  • ioLatchTimeouts = very short time physical lock on a buffer while an IO operation is performed
  • totalLongIos = Must be current - previous > 0 for warning

 

<ioSubsystem ioLatchTimeouts="0" intervalLongIos="0" totalLongIos="0"><longestPendingRequests></longestPendingRequests></ioSubsystem>

 

 

4.5 Component: Event 

On the event component, we will collect data on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity. Events will always show 0 as the state. 

 

 

5.0 References: 

Use the system_health Session - SQL Server | Microsoft Learn

Database Engine error severities - SQL Server | Microsoft Learn

Under the covers with sp_server_diagnostics data (Part I) - Joe Sack (sqlskills.com)

Under the covers with sp_server_diagnostics data (Part II) - Joe Sack (sqlskills.com)

SQL SERVER - PREEMPTIVE Waits - Simple Definition - SQL Authority with Pinal Dave

SQL Server, Memory Manager object - SQL Server | Microsoft Learn

sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn

Use DBCC MEMORYSTATUS to monitor memory usage - SQL Server | Microsoft Learn

How It Works: sp_server_diagnostics – spinlock backoffs - Microsoft Community Hub

Using sp_server_diagnostics to Monitor SQL Server (mssqltips.com)

performance - Suddenly SQL server restarted with pending task count increased - Database Administrators Stack Exchange

sys.dm_os_tasks (Transact-SQL) - SQL Server | Microsoft Learn

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.