User initiated manual failover on SQL Managed Instance

High availability is a fundamental part of SQL Managed Instance (MI) platform that works transparently for your database applications.  As such, failovers from primary to secondary nodes in case of node degradation or fault detection, or during our regular monthly software updates are an expected occurrence for all applications using SQL Managed Instance in Azure.


 


In August 2020, we have released a new feature user-initiated manual failover allowing to manually trigger a failover on SQL Managed Instance using PowerShell or CLI commands, or through invoking an API call. Manually initiated failover on a managed instance will be an equivalent of the automated failover for high availability and software patches initiated automatically by the service. 


 


Manually invoking a failover on MI will help test end-to-end applications for fault resiliency on automatic failovers in case of planned or unplanned events before deploying to production. In addition to testing how failover impacts existing database sessions, it can also help verify if it changes the end-to-end performance due to changes in the network latency. In some cases, if performance issues are encountered on SQL MI, manually invoking a failover to a new node can help mitigate the performance issue.


 


Because the restart operation is intrusive and a large number of them could stress the platform, only one user-initiated manual failover call is allowed every 30 minutes for each managed instance (Update: Engineering team is looking to reduce this value to 15 minutes in the next few months).


 








Ensuring that your applications are failover resilient prior to deploying to production will help mitigate the risk of application faults in production and will contribute to application availability for your customers.

 


How is High Availability (HA) implemented on a Managed Instance?


 


Azure SQL Managed Instance (MI) is offered in two service tiers, one is Business Critical (BC) and the other one is GP (General Purpose). Both service tiers offer High Availability (HA), with different technical implementations, as follows:


 















  • HA for SQL Managed Instance BC (Business Critical) service tier was built based on AlwaysOn Availability Groups (AG) technology, resulting in such MI consisting of the total of 4 nodes – one primary and three secondary R/O replicas. In case of a failover, one of the secondary replicas becomes primary. This type failover typically takes only a few short seconds.


business-critical-service-tier.png


  • HA for SQL Managed Instance GP (General Purpose) service tiers was based on multiple redundancy of the storage layer and it is based on a single primary node. In case of a failover, a new node is taken from the pool of standby nodes, and the storage is re-attached from the old to the new primary node. This type of failover typically takes under a minute.


general-purpose-service-tier.png

 


Using the user-initiated manual failover functionality, manually initiating a failover on MI BC service tier will result in a failover of the primary node to one of the three secondary nodes. As secondary read-only nodes on the MI BC service tier can be used for read scale-out from a single node (out of three read-only secondary nodes), the user initiated manual failover capability allows also a failover of read-only replica. This means that users can manually failover the read scale-out from the current to one of the two other available read-only secondary nodes.


 


Manually initiating a failover on MI GP service tier will result in deallocation of the primary node, and allocation of a new node from the pool of available nodes, and reattachment of the storage from the old to the new node.


 


How to initiate a manual failover on SQL Managed Instance?


 


Using PowerShell


 


The minimum version of Az.Sql needs to be v2.9.0 (download link), or use Azure Cloud Shell from the Azure portal that always has the latest PowerShell version available.


 


Use PS command Invoke-AzSqlInstanceFailover with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:












PowerShell



Invoke-AzSqlInstanceFailover -ResourceGroupName “ResourceGroup01” -Name “ManagedInstance01”



 


Use the following PS command to failover read secondary node, applicable to BC service tier only:












PowerShell



Invoke-AzSqlInstanceFailover -ResourceGroupName “ResourceGroup01” -Name “ManagedInstance01” -ReadableSecondary



 


Using CLI


 


Ensure to have the latest CLI scripts installed.


 


Use az sql mi failover CLI command with the following example to initiate failover of the primary node, applicable to both BC and GP service tier:












CLI



az sql mi failover -g myresourcegroup -n myinstancename



 


Use the following CLI command to failover read secondary node, applicable to BC service tier only:












CLI



az sql mi failover -g myresourcegroup -n myinstancename –replica-type ReadableSecondary



 


Using Rest API


 


For advanced users who would perhaps like to automate failovers of their SQL Managed Instances for purposes of implementing continuous testing pipeline, or automated performance mitigators, this can be accomplished through initiating failover through an API call, see Managed Instances – Failover REST API for details.


 


To initiate failover using REST API call, first generate the Authentication Token. One way to do that is to use a Postman client. Initiating the API call from any other client should generally work as well. This token is used as Authorization property in the header of API request and it is mandatory.


 


The following is an example of the API URI to call:











API URI
https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Sql/managedInstances/{managedInstanceName}/failover?replicaType=Primary&api-version=2019-06-01-preview

 


The following are API call properties that can be passed in the call:






























API property



Parameter



subscriptionId



Subscription ID to which managed instance is deployed



resourceGroupName



Resource group that contains managed instance



managedInstanceName



Name of managed instance



replicaType



(Optional) (Primary|ReadableSecondary)


 


This is the type of replica to be failed over: primary or readable secondary.
If not specified, failover will be initiated on the primary replica by default.



api-version



Static value and currently needs to be “2019-06-01-preview”



 


API response will be one of the following two:



  • 202 Accepted

  • One of the 400 request errors.


 


Track the operation status


 



 








Note: Completion of the failover process (not the actual short unavailability) might take several minutes at a time in case of high-intensity workloads. This is because the instance engine is taking care of all current transactions on the primary and catch up on the secondary, prior to being able to failover.

 


Monitoring the failover


 


To monitor the progress of user initiated manual failover, execute the following T-SQL query in your favorite client (such is SSMS) on SQL Managed Instance. It will read the system view sys.dm_hadr_fabric_replica_states and report replicas available on the instance. Refresh the same query after initiating the manual failover.


 











T-SQL

SELECT DISTINCT replication_endpoint_url, fabric_replica_role_desc FROM sys.dm_hadr_fabric_replica_states



 


Monitor failover for BC (Business Critical) Service tier


 


Prior to initiating the failover, your output will indicate the current primary replica on MI BC containing one primary and three secondaries in the AlwaysOn Availability Group.


 


Replica 001.PNG


 



  • In this example we see the primary and three secondary replicas on MI BC node, each assigned to an internal IP address starting with 10.0.0.X. The primary node in this example has been allocated to the internal IP address 10.0.0.16.

  • Upon execution of a failover, running this query again would need to indicate a change of the primary node.


Replica 002.PNG


 



  • In this particular example, we can see that before the failover, the primary node IP was 10.0.0.16, whereas after the failover this node became secondary, and the new primary node became node with the IP 10.0.0.22.


Monitor failover for GP (General Purpose) Service tier


 


You will not be able to see the same output with MI GP service tier as the one above shown for MI BC as this service tier is based on a single node only. Your T-SQL query output for MI GP service tier will show a single node before and after the failover, something as the following:


 


gp 01.PNG


 



  • Upon the failover of MI GP, if you refresh the T-SQL query there will be no availability (loss of connectivity) from your client until the node failover has been executed (typically under a minute), after which the query will show the same IP of the primary replica.

  • While it will not be possible to see the change of the primary replica to a different one through T-SQL query, the loss of connectivity to MI GP during the failover will be the indication of the failover execution.


 


Functional limitations


 



  • Throttle mechanism is implemented to guard from potentially too many failovers. As such, you can initiate one failover on the same MI every 30 minutes. If this is the case, there will be an error message shown when attempting to initiate a manual failover within this protected time frame.


Image 03.PNG


 



  • For BC instances there must exist quorum of replicas for the failover request to be accepted. This means that failover can be executed only in the case all replicas are healthy. If this is not the case, and if one of replicas is unhealthy or being rebuilt, you will not be able to manually initiate a failover at such time.

  • It is not possible to specify which readable secondary replica to initiate the failover on. This is because MI allows automatically for only a single read-only replica to be available to customers.


Disclaimer


 


Please note that products and options presented in this article are subject to change. This article reflects the user-initiated manual failover option available for Azure SQL Managed Instance in August, 2020.


 


Closing remarks


 


If you find this article useful, please like it on this page and share through social media.


 


To share this article, you can use the Share button below, or this short link: https://aka.ms/mifailover-techblog.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.