Understanding how Failover Clustering Recovers from Unresponsive Resources

This post has been republished via RSS; it originally appeared at: Failover Clustering articles.

First published on MSDN on Jan 24, 2013
In this blog I will discuss how Failover Clustering communicates with cluster resources, along with how clustering detects and recovers when something goes wrong.  For the sake of simplicity I will use a Virtual Machine as an example throughout this blog, but the logic is generic and applies to all workloads.

When a Virtual Machine is clustered, there is a cluster “Virtual Machine” resource created which controls that VM.  The “Virtual Machine” resource and its associated resource DLL communicates with the VMMS service and tells the VM when to start, when to stop, and it also does health checks to ensure the VM is ok.

Resources all run in a component of the Failover Clustering feature called the Resource Hosting Subsystem (RHS).  These VM actions from the user map to entry point calls that RHS makes to resources, such as Online, Offline, IsAlive, and LooksAlive.  You can find the full list of resource DLL entry-point functions here .

The most interesting in most cases where resources go unresponsive and you see clustering need to recover is with the LooksAlive and IsAlive which is a health check to the resource.

    • LooksAlive is a quick light lightweight check that happens every 5 seconds by default

 

    • IsAlive is a more verbose check that happens every 60 seconds by default.


Health check calls to the resource continue constantly while resources are online.  If a resource returns a failure for the lightweight LooksAlive health check, RHS will then immediately do a more comprehensive health check and call IsAlive to see if the resource is really healthy.  A resource is considered failed as the result of an IsAlive failure.

Think of it like this… Every 60 seconds RHS calls IsAlive and basically is asking the resource “Are you ok?”.  And the resource then responds to RHS “Yes, I am doing fine.”  This periodic health check goes on and on…      Until, there can be a case where something happens to the resource and it doesn’t respond.  Think of it like a dropped call on your cell phone, how long are you willing to sit there going “Hello?  Hello?  Hello?”… before you give up and call the person back?  Basically resetting the connection…

Failover Clustering has this same concept.  RHS will sit there waiting for the resource to respond to an IsAlive call, and eventually it will give up and need to take recovery action.  By default RHS will wait for 5 minutes for the resource to respond to an entry point call to it.  This is configurable with the resource DeadlockTimeout common property.

To modify the DeadlockTimeout property of an individual resource, you can use the following PowerShell cmdlet command:


(Get-ClusterResource “Resource Name”).DeadlockTimeout = 300000


Or if you want to modify the DeadlockTimeout for all resources of that type you can modify it at the resource type level with the following syntax (this example will be for all virtual machine resources):


(Get-ClusterResourceType “Virtual Machine”).DeadlockTimeout = 300000


Resources are expected to respond to an IsAlive or LooksAlive within a few hundred milliseconds, so waiting 5 minutes for a resource to respond is a really long time.  Something pretty bad happened if a resource which normally responds in milliseconds, suddenly takes longer than 5 minutes.  So it is generally recommended to stay with the default values.

If the resource doesn’t respond in 5 minutes, RHS decides that there must be something wrong with the resource and that it should take recovery action to get it back up and running.  Remember that the resource has gone silent; RHS has no idea what is wrong with it.  The only way to recover and get the resource back up and running is that the RHS process is terminated, then RHS restarts, which will then restart the resource, and everything is back up and running.  You may also see the associated entries in the System event log:

 

Event ID 1230
Cluster resource ‘ Resource Name ’ (resource type ‘ Resource Type Name ’, DLL ‘ DLL Name ’) did not respond to a request in a timely fashion. Cluster health detection will attempt to automatically recover by terminating the Resource Hosting Subsystem (RHS) process running this resource.

 

Event ID 1146
The cluster Resource Hosting Subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually associated with recovery of a crashed or deadlocked resource.


The next layer of protection is that when clustering issues a request to terminate the RHS process, it will wait four times the DeadlockTimeout value (which equates to 20 minutes by default) for the RHS process to terminate.  If RHS does not terminate in 20 minutes, clustering will deem that the server has some serious health issues and will bugcheck the server to force failover and recovery.  The bugcheck code will be Stop 0x0000009E ( Parameter1 , Parameter2 , 0x0000000000000005 , Parameter4 ).  Note: that Parameter3 will always be a value of 0x5 if it is the result of an RHS process failing to terminate.

This is the way clustering is designed to work… it is monitoring the health of the system, it detects something is wrong, and recovers.  This is a good thing!

Summary of Recovery Behavior:

    1. RHS calls an entry point to resource
    2. RHS waits DeadlockTimeout (5 minutes) for resource to respond
    3. If resource does not respond, Cluster Service terminates RHS process to recover from unresponsive resource
    4. Cluster Service waits DeadlockTimeout x 4 (20 minutes) for the RHS process to terminate
    5. If RHS process does not terminate, Cluster Service calls NetFT to bugcheck the node to recover from RHS termination failure
    6. NetFT bugchecks the node with a STOP 0x9e

 

Impact of RHS Recovery


The Resource Hosting Subsystem (RHS) is the process which hosts resources, and for any given node if there are multiple resources currently online and being hosted by a node they may share a common RHS process.  For example, if you had 5 clustered VMs running on the same node, all the resources associated with those VMs would all be running in the same RHS process.

There are some side effects from terminating the RHS process when a resource goes unresponsive.  If there are multiple resources hosted on that node, they may be hosted in the same RHS process.  That means when RHS terminates and restarts to recover an individual resource, all resources being hosted in that specific RHS process are also restarted.  With Windows Server 2008 R2 if you have 5 VMs running on a node, all 5 VMs are going to get restarted.

If a resource becomes unresponsive and causes an RHS crash, the cluster service will deem that specific resource to be suspect and that it needs be isolated.  Think of it as, one strike and you are out!  The cluster service will automatically set the resource common property SeparateMonitor to mark that resource to run in its own dedicated RHS process, so that in the event that the resource becomes unresponsive again; it will not affect others.  This setting is also configurable, you can either manually enable a resource to run in its own RHS process or you can disable a resource from running in its own RHS process as the result of having had an issue in the past which is now addressed.

To modify the SeparateMonitor property of an individual resource, you can use the following PowerShell cmdlet command:


(Get-ClusterResource “Resource Name”).SeparateMonitor = 0


The impact of running resources in their own dedicated RHS process is that each RHS process consumes a little more system resources.  If you open Task Manager you will see a series of “Failover Cluster Resource Host Subsystem” processes running, each of which consuming a few MB of RAM.

In general clustering will self-manage misbehaving resources.  Resources will be given a chance to play nicely with everyone else, and if they don’t they will be automatically isolated to minimize impact.  So it is generally recommended to stay with the default values.

Improvements in Windows Server 2012


There are some feature enhancements in Windows Server 2012 to mitigate the impact of non-responsive resource recovery.

Resource Re-attach :  When a resource goes unresponsive the RHS process will recycle just as before, but any healthy resources in a running state will have their resources re-attach to the new RHS process without having to be restarted.  This means that impact from recovery is reduced, just 1 VM gets restarted and the other 4 are not impacted.

    • The resource DLL must support resource re-attach to be compatible with this new feature.  In-box resource types such as Virtual Machine and Physical Disk have been enhanced in Windows Server 2012 to take advantage of this new feature.
  • Isolation of Core resources :  Resources are now segmented by default into multiple RHS processes to keep application resources deadlocks from impacting core cluster functionality

    • All in-box resources (in ClusRes.dll) run in a dedicated Core RHS process
    • All “Physical Disk” resources run in a dedicated Storage RHS process
    • 3rd party resources run in a dedicated RHS process

Additionally resources can also be marked with the SeparateMonitor property to run in their own dedicated RHS process in Windows Server 2012, as they could in previous releases.

How to Troubleshoot RHS Recovery


Everything we have discussed in this blog to this point has describing the expected behavior of how Failover Clustering recovers when something goes wrong with a resource and it becomes unresponsive.  Now the most important question… What do you do about it?

Troubleshooting Steps:

    1. Open Event Viewer and look for an Event ID 1230

 

    1. Identify the date / time as well as the resource name and resource type

 

    1. Generate the cluster.log with the Get-ClusterLog cmdlet

 

    1. Go to the C:\Windows\Cluster\Reports folder and open the Cluster.log file

 

    1. Using the time stamp from the Event ID 1230 find the point of the failure

        • Reminder:  The event log is in local time and the cluster.log is in GMT.  With Windows Server 2012 you can use the Get-ClusterLog –UseLocalTime to generate the Cluster.log in local time.  This will make correlating with the event log easier.


 

    1. Identify which entry point was being called to the resource.  There will be a log entry something similar to:
      ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'ResourceName'.
      INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.
      ERR   [RHS] Resource ResourceName handling deadlock. Cleaning current operation and terminating RHS process.

 

    1. Look up what that entry point for that resource type is attempting to do.  For in-box resources you will find them documented here
      KB914458 – Behavior of the LooksAlive and IsAlive functions for the resources that are included in the Windows server Clustering component of Windows Server 2003.

 

    1. Now that you understand what entry point was being called to which resource and when, you need to investigate the underlying component.  For example:

        1. Physical Disk resource IsAlive will effectively attempt to enumerate the file system, so you should troubleshoot your storage subsystem in why I/O’s are not completing.

        1. File Server LooksAlive will attempt to retrieve the properties of the SMB shares, so you should troubleshoot the Server service.




Advanced Troubleshooting:

    1. When RHS recovers from a deadlock it will generate a Windows Error Reporting report and a user-mode dump of the RHS process.  With the user-mode dump you can determine what the resource DLL was attempting to do when it failed to respond.  See this blog for details in how to debug the user-mode dump to troubleshoot the resource http://blogs.msdn.com/b/ntdebugging/archive/2011/05/30/what-is-in-a-rhs-dump-file-created-by-windows-error-reporting.aspx

        • Note:  The KB article 914458 will generally provide sufficient information on what the resource DLL was attempting to do and this should not normally be necessary.


 

    1. To help pinpoint root cause just having a user-mode dump may not be enough, you can also configure RHS to bugcheck the box and generate a full memory dump.  This can be enabled by setting the following registry DWORD to a value of 3
      HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Failover Clusters\DebugBreakOnDeadlock
      Starting with Windows Server 2012 R2 set the following registry DWORD to a value of 3:
      HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\DebugBreakOnDeadlock

NOTE: DebugBreakOnDeadlock will only create a dump if the RHS process itself deadlocks, not a resource.  When a resource deadlocks, RHS will attempt to terminate it.  As part of the termination, it should create a WER Report with a small heap and process dump.  Once those are completed, RHS will terminate the resource.  If it is successful in terminating, then RHS itself will not deadlock.  Since RHS itself does not deadlock, no dump is created.  So this may be something not needed.

The key take-away is that RHS recovery is expected behavior for a resource that has become unresponsive.  To address the root cause issue you need to dig in to which resource is failing and then by understanding what it was attempting to do, you can identify why it didn’t respond.

For additional information on troubleshooting resources that result in RHS recovery, see the blogs below.  Microsoft support is also available to assist in advanced debugging to help you identify root cause.

Additional Resources


Resource Hosting Subsystem (RHS) In Windows Server 2008 Failover Clusters
http://blogs.technet.com/b/askcore/archive/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters.aspx

Thanks!
Elden Christensen
Principal PM Manager
Clustering & High-Availability
Microsoft

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.