Monitor replication lag for Auto-Failover Groups in SQL Managed Instance

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

The auto-failover groups feature allows you to manage the replication and failover of a group of databases on a server or all databases in a managed instance to another region. It is a declarative abstraction on top of the existing active geo-replication feature, designed to simplify deployment and management of geo-replicated databases at scale.

To monitor the replication lag between primary and secondary instances, and last replication time of the secondary databases, we can use the DMV sys.dm_geo_replication_link_status.

Run the below query on your primary instance:

SELECT

link_guid

, partner_server

, partner_database

, last_replication

, replication_lag_sec

FROM sys.dm_geo_replication_link_status;

Each row represents an information for each database under the SQL instance.

replication_lag_sec : Shows the time difference in seconds between the last_replication value and the timestamp of that transaction's commit on the primary based on the primary database clock. This value is available on the primary database only.
last_replication : Shows the timestamp of the last transaction's acknowledgement by the secondary based on the primary database clock. This value is available on the primary database only.

Moreover, both primary and secondary instances are required to have the same service tier. If the primary database is experiencing a heavy write workload, a secondary with lower compute size may not be able to keep up with it. That will cause redo lag on the secondary, and potential unavailability of the secondary. To mitigate these risks, active geo-replication will throttle the primary's transaction log rate if necessary, to allow its secondaries to catch up.

We can monitor the replication lag over time in SQL Managed Instance by creating a scheduled agent job to run and capture the lag into a table every ex: 5 min for a day or two, to check how the lag time can be different along with the workload.

Short steps can be:

1) Create a table for example named monitor_lag

2) Create a job with the below step and set it to be run every 5 minutes.

INSERT INTO monitor_lag

SELECT

partner_database, last_replication,last_commit,replication_lag_sec

FROM sys.dm_geo_replication_link_status

--Where partner_database = 'DB name' – If you need to monitor a specific database.

3) Review the lag time over time.

Enjoy!

Leave a Reply Cancel reply