Let’s make it easy! How to read System Health and sp_server_diagnostics?

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Hi Everyone,

The system_health session is an extended event session that is included by default with SQL Server. This blog post goes over how to easily read and interpret System Health extended event and sp_server_diagnostics with an example. For this post, we will focus on what sp_server_diagnostics reports during the period of 3/15/2023 at 4:00am to 3/15/2023 at 11:20am.

Let us say you are being approached to figure out the reason for a glitch during this time window, even if you did not have any monitoring solution in place, you can still tap into system_health session since this is available by default on your SQL Server instance. 

1.0 Overview of system_health extended event and sp_server_diagnostics

1.1 What is inside system health xevent?

The sql_text and session_id for any sessions that encounter an error that has a severity >= 20. Database Engine error severities - SQL Server | Microsoft Learn
The sql_text and session_id for any sessions that encounter a memory-related error. The errors include 17803, 701, 802, 8645, 8651, 8657 and 8902.
A record of any non-yielding scheduler problems. These appear in the SQL Server error log as error 17883.
Any deadlocks that are detected, including the deadlock graph.
The callstack, sql_text, and session_id for any sessions that have waited on latches (or other interesting resources) for > 15 seconds.
The callstack, sql_text, and session_id for any sessions that have waited on locks for > 30 seconds.
The callstack, sql_text, and session_id for any sessions that have waited for a long time for preemptive waits. The duration varies by wait type. A preemptive wait is where SQL Server is waiting for external API calls.
The callstack and session_id for CLR allocation and virtual allocation failures.
The ring buffer events for the memory broker, scheduler monitor, memory node OOM, security, and connectivity.
System component results from sp_server_diagnostics.
Instance health collected by scheduler_monitor_system_health_ring_buffer_recorded.
CLR Allocation failures.
Connectivity errors using connectivity_ring_buffer_recorded.
Security errors using security_error_ring_buffer_recorded.

1.2 What is inside the output of the internal stored procedure: Sp_Server_Diagnostics?

Since the stored procedure sp_server_diagnostics is internal to SQL Server and cannot be scripted out, the section below expands on how to read the data column of each column from the output of this stored procedure and how to interpret the results.

The output of the stored procedure sp_server_diagnostics is logged in both system_health and sqldiag xevent sessions with different frequencies.

For System_health, the frequency is 5 minutes, and for SQLDiag, the default is 10 seconds with 30 seconds as the health check timeout.

Data captured by sp_server_diagnostics. (Reference: sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn)

Column	Data type	Description
create_time	datetime	Indicates the time stamp of row creation. Each row in a single rowset has the same time stamp.
component_type	sysname	Indicates whether the row contains information for the SQL Server instance level component or for an Always On availability group: System: Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage. This information is producing an overall health state recommendation. Resource: Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. This information produces an overall health state recommendation. query_processing: Collects data from a query-processing perspective on the worker threads, tasks, wait types, CPU intensive sessions, and blocking tasks. This information produces an overall health state recommendation. IO_subsystem: Collects data on IO. In addition to diagnostic data, this component produces a clean healthy or warning health state only for an IO subsystem. Events: Collects data and surfaces through the stored procedure on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity . Events will always show 0 as the state. <name of the availability group>: Collects data for the specified availability group (if component_type = "Always On:AvailabilityGroup").
component_name	sysname	Indicates the name of component or the name of the availability group: System, Resource, Query_processing, Io_subsystem, Events or <name of the availability group>
state	int	Indicates the health status of the component: 0 1 2 3
state_desc	sysname	Describes the state column. Descriptions that correspond to the values in the state column are: 0: Unknown 1: clean 2: warning 3: error
data	varchar (max)	Specifies data that is specific to the component.

2.0 How to open multiple extended events on SSMS (SQL Server Management Studio)?

Open SSMS and click on file> open> Merge extended event files.

Next you will see a pop click on “add”.
Select all the xevents for system_health (They can be captured with a tool like TSS, PSSDAIG, or SQL LogScout or you can also find the system_health xel files on the same log folder as the SQL errorlogs)
Click “open” and then “ok”.

 3.0 How to analyze the merged extended event files?

Since the extended event has a running log of entries, the first thing to do is to apply a time filter to narrow the log entries down to the time of the issue.

So, when you open the xevent files on the ribbon click on filter. 

A popup will show up, click on set time filter> set the start time and end time to the hours that you want to review.

Important!!! The xevent timestamp will show with your time zone, you will need to convert the time you can check the SQL error log to see the time zone. You also have the option to add the column timestamp (UTC). To add this column right click on the columns > click on choose column> look for timestamp (UTC)

For this demo, we will only focus on the output from sp_server_diagnostics and since the issue in this example happened at 6 am EST (UTC-4) and continue happening for sometime, we will start looking for events starting from 4 am (UTC -6) until 11:21 am as shown on the screenshot below. Next on the bottom we can use the filter “Name = sp_server_diagnostics_component_result” and “state not like clean” 

After filtering it will give you a list of components that do not have a clean state. You can add the field as a column by right clicking the field and “show column in table.”

 From here we will look at each component and what they are reporting. You can group by component and sort by time. To group by right click on the column and click on group by this column.  

And it will look something like this.

4.0 What does each component report?

Now that we have the data that we want to look at least see what each component reports on the data column. This column is on varchar (max)

Some of the counters are very straight forward some are not, but here are some conditions that would cause warnings or errors.

Component	HealthState	LogicalOperation	Check	Operator	Value	Notes
SYSTEM	WARNING		DumpCount	>	0	Are there any Dumps?
SYSTEM	WARNING	OR	LatchWarnings	>	3	Having issues with Latches
SYSTEM	WARNING	OR	SpinLockType	!=	SP_LAST [special spinlock]	Sick Spinlock
SYSTEM	WARNING	OR	BadPagesDetected	>	0	Bad memory?
SYSTEM	ERROR		DumpCount	>	100	Over 100 dumps?
SYSTEM	ERROR	AND	CurrentDumpCount - LastDumpCount	>	0	Have there been ANOTHER dump?
SYSTEM	ERROR	AND	CurrentDumpCount	>	LastDumpCount	Same as above
SYSTEM	ERROR	OR	MemoryScribblers	>	3	Has memory been overwritten by another process?
SYSTEM	ERROR	OR	SpinLockTypeAfterAv	!=	SP_LAST [special spinlock]	Spinlock Type after AV
RESOURCE	ERROR		ProcessOOM	>	120 Seconds	OOM for 2 minutes
RESOURCE	WARNING		PoolOOM	=	TRUE	Any pool out of memory
RESOURCE	WARNING	OR	LastResourceNotification	=	MEMPHYSICAL_LOW	Low physical memory
RESOURCE	WARNING	OR	LastResourceNotification	=	MEMVIRTUAL_LOW	Low virtual memory
RESOURCE	WARNING	OR	LastOOMExceptionCount	>	OOMExceptionCount	Have OOM exceptions?
RESOURCE	WARNING	OR	ProcessOOMCount	!=	0	Is the process itself out of memory
QUERYPROCESSOR	WARNING		PendingTasksCount	>	0	Pending tasks on schedulers
QUERYPROCESSOR	WARNING	OR	NonYieldTrackiongScheduler	!=	NULL	Is there a non-yielding scheduler
QUERYPROCESSOR	ERROR		isUnresolvedDeadlock	=	TRUE	Unresolvable deadlock?
QUERYPROCESSOR	ERROR	OR	IsDeadlockedSchedulers	=	TRUE	Any schedulers deadlocked?
IO	WARNING		LongIOs	>	0	Must be current - previous > 0 for warning

4.1 Component: SYSTEM

Collects data from a system perspective on spinlocks, severe processing conditions, non-yielding tasks, page faults, and CPU usage.

The categories of information that are surfaced for the "system" component data are as follows:

CPU utilization (sqlCpuUtilization, systemCpuUtilization)
Page Faults (pageFaults)
Non-Yielding tasks (nonYieldingTasksReported)
Dumps (intervalDumpRequests, totalDumpRequests)
Access Violations (writeAccessViolationCount, isAccessViolationOccurred )
Latches (latchWarnings)
Spinlock activity (spinlockBackoffs, sickSpinlockType, sickSpinlockTypeAfterAv )

On the above example we can the following

We have totalDumpRequests="6" and pageFaults="1438916"
We can see that systemCpuUtilization="19" and sqlCpuUtilization="14" of the total 19.
We can conclude that we have high pageFaults="1438916" and totalDumpRequests="6" which should be always 0.
Page fault will not trigger a warning or error by itself but can give you an indication of what is going on with the server.
A page fault occurs when a program requests an address on a page that is not in the current set of memory resident pages. What happens when a page fault occurs is that the thread that experienced the page fault is put into a Wait state while the operating system finds the specific page on disk and restores it to physical memory. We need to look at the memory, PLE and disk to confirm this because we would also get a soft page fault. https://techcommunity.microsoft.com/t5/ask-the-performance-team/the-basics-of-page-faults/ba-p/373120
To track paging, you should use the following counters: “Memory\ Page Faults /sec”, “Memory\ Cache Faults /sec” and “Memory\ Page Reads /sec”. The first two counters track the working sets and the file system cache. The Page Reads counter allows you to track hard page faults. If you have a high rate of page faults combined with a high rate of page reads (which also show up in the Disk counters) then you may have an issue where you have insufficient RAM given the high rate of hard faults.

4.2 Component: RESOURCE

Collects data from a resource perspective on physical and virtual memory, buffer pools, pages, cache, and other memory objects. The data column output is divided into three areas of interest.

The initial area of focus pertains to overall resource monitoring.

The second segment of data derived from the "resource" component pertains to a memory report that encompasses Process and System counters and counts.

The third memory report sourced from the "resource" component data comprises Memory Manager counter values.

Last Notification looks to be the memory manager’s latest report.

RESOURCE_MEM_STEADY: No memory issues were observed.
RESOURCE_MEMPHYSICAL_HIGH: SQL Server can acquire memory because OS has sufficient free available memory.
RESOURCE_MEMPHYSICAL_LOW: SQL Server must return the memory as OS is having memory shortage.
RESOURCE_MEMVIRTUAL_LOW: The virtual address range for the SQL Server process is getting exhausted.

Some observations:

In our example we have lastNotification="RESOURCE_MEMPHYSICAL_LOW" telling us that SQL Server cannot further consume more memory or SQL Server must return the memory as OS is having memory shortage.
For our case we can see that the system does have memory "System physical memory high" value="1"/>
But SQL Server has reached its limit "Target Committed" value="471859208"/> "Current Committed" value="471862328"/>

4.3 Component: Query

There are five helpful groupings of statistics into which the data can be categorized:

· Query processing statistics

· Top non-preemptive wait statistics by count

· Top non-preemptive wait statistics by duration

· Top preemptive waits by count

· Top preemptive waits by duration

The initial segment will comprise data related to query processing statistics.

Returning a point of time measurement of max workers, actual workers created, idle workers, pending tasks, oldest pending task waiting time (I’ll be paying attention to this one) and scheduler deadlock and non-yielding counts.

The second part we have waits. Preemptive and nonPreemptive

Preemptive: refers to the type of multitasking where the operating system interrupts an executing task to allocate the CPU to another task
Non-Preemptive: refers to the type of multitasking where the CPU is allocated to a task and the task holds the CPU until it finishes or voluntarily relinquishes it.

We have four sets of results for the Top 10 wait stats.

Here is the output of the top non-preemptive waits, sorted by count based on the "waits" attribute.

Here is the output of the top non-preemptive waits, sorted by duration.

Here is the output of the top preemptive sorted by waits count.

Here is the output of the top preemptive sorted by duration.

</byDuration></preemptive></topWaits>

You will also get intensive request session.

</cpuIntensiveRequests>

Next will be pending task.

And in case there were any blocking happening it will show up here.

For our example we can see that we have one pending task pendingTasks="1" This means the scheduler was hung and 1 task were waiting on it to get a scheduler and run.

SQL Server pending tasks refer to a set of activities that are currently waiting to be executed by the SQL Server but are temporarily blocked due to various reasons such as resource contention, I/O latency, or other dependencies. These tasks are queued in a pending state until the resources become available to process them further. The pending tasks can be monitored and managed using various performance monitoring and troubleshooting tools in SQL Server.

4.4 Component: IO

We have the I/O component here it will be very hard to get a warning, but we can see if there were any latch timeouts and long I/Os.

Long I/Os are considered when it takes 15 seconds or longer.

ioLatchTimeouts = very short time physical lock on a buffer while an IO operation is performed
totalLongIos = Must be current - previous > 0 for warning

4.5 Component: Event

On the event component, we will collect data on the errors and events of interest recorded by the server, including details about ring buffer exceptions, ring buffer events about memory broker, out of memory, scheduler monitor, buffer pool, spinlocks, security, and connectivity. Events will always show 0 as the state.

5.0 References:

Use the system_health Session - SQL Server | Microsoft Learn

Database Engine error severities - SQL Server | Microsoft Learn

Under the covers with sp_server_diagnostics data (Part I) - Joe Sack (sqlskills.com)

Under the covers with sp_server_diagnostics data (Part II) - Joe Sack (sqlskills.com)

SQL SERVER - PREEMPTIVE Waits - Simple Definition - SQL Authority with Pinal Dave

SQL Server, Memory Manager object - SQL Server | Microsoft Learn

sp_server_diagnostics (Transact-SQL) - SQL Server | Microsoft Learn

Use DBCC MEMORYSTATUS to monitor memory usage - SQL Server | Microsoft Learn

How It Works: sp_server_diagnostics – spinlock backoffs - Microsoft Community Hub

Using sp_server_diagnostics to Monitor SQL Server (mssqltips.com)

performance - Suddenly SQL server restarted with pending task count increased - Database Administrators Stack Exchange