This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
If you are managing a hierarchy for sometime and dealt with Replication issues before; I assume you would have an understanding of the DRS Initialization process. If you feel you need some guidance then refer:
But what is very obscure and often called a Black box is the DRS Message flow and the synchronizations of groups when the site is ACTIVE.
So in short how is the flow in a day to day life when things are good. You create a package on the console on CAS and it replicates back to the Primary Site or vice versa.
How does this happen?
A 100 feet flow that we have told before is when the data gets inserted in the given site we extract the changes via Change Tracking and then convert them into message and send it via Service broker.
Well all is good when talking in high overview. At CSS, there is no high overview, we have to know the skin of the things and the internals.
For each replication groups much more needs to happens for them being to show as ACTIVE individually when the Site is ACTIVE.
So the million dollar question is After Init is finished and the site is ACTIVE, how do we monitor the Synchronization and changes of the groups?
Note: The below concepts are there to clarify the concepts of DRS. Do not change or run any modification queries in the production without any expert guidance. For any recommendations or issues, please open a ticket with Microsoft CSS.
DRS Sync Process – On the initiating site
Part 1 : To decide whether to send changes for a group
- RCMCtrl regularly calls the THREAD_InvokeDrsSyn() function as a part of regular loop.
- That calls the SP spDRSInitiateSynchronizations which has the logic where it checks the difference between the last send time and the current time is more than the synchronization interval of the group.
- Now if the difference is more, means we have to send the changes for this group which initiates the synchronization by initiating a dialog in the ConfigMgrDRSMsgBuilderQueue.
- The MessageType = 'DRS_StartMsgBuilder' is the Message Type which marks the beginning of the replication and is done for each replication group. Note that the message will only be put in the queue if there is no message
for that group in the queue already. A message there already means the previous sync has still not completed.
- When the message is placed in the ConfigMgrDRSMsgBuilderQueue, it invokes its Activation Procedure spDRSMsgBuilderActivation.
- As a part of this procedure, We check if the sys.dm_exec_requests already has the context of the current replication group then we throw a message "<ReplicationGroup> is already being processed.'
- If there is no message present for the group in question (meaning there are no previous syncs pending ) then we call the main extraction routine
EXEC dbo.spDRSSendChangesForGroup @ReplicationGroup
Part 2 : The main extraction process on the initiating site
Now this procedure sends the changes by extracting the changes and send them as a message. The process starts as follows [expanding on the flow chart steps] -
- Checking for which groups to not send the changes:
It checks if the last two syncs for this Replication group has completed or not [SyncCompleteTime is NULL] . We will learn more about SyncCompleteTime later as to how it gets updated. If not it puts it in the temp table @SitesNotToSend and would not send the changes to those Sites for this replication group.
It will log the below in Vlogs –
'Not sending changes to sites <ReceivingSiteCode> for Replication group <ReplicationGroup> since last 2 syncs to these sites have not completed.'
Here is a sample of the TSQL code that does that –
-- get sites which have not returned a sync complete within the throttle window
- We check if there are any sites for this replication group for which the dialog handle got changed and the previous sync has not completed for it.
If yes, we log the below entry in Vlogs:
‘The dialog handle for sites @PendingSyncSitesWithChangedDialog has changed and previous sync has not completed. Not processing sync for them during this cycle. Will try again in the next cycle.'
Here is the code which checks this:
LastSendResult -20 means the same that the DialogHandle changed. We have this code later which updates the same.
Here is the full key for LastSendResult in DRSSendHistory
-99: Snapshot isolation error
-20: Sync handle changed while previous syncs were not done
-10: LastSendVersion changed while sync was in progress
-3: Replication group needs re-init and invalid subscription is sent by message builder
-2: Received exception while sending DRS_SyncStart or DRS_SyncEnd
-1: Any other error during sync
- Compiling the changes to be send:
We basically call spGet<table>Changes (“Site” replication pattern) or function fnGet<Table>Changes (“global” replication pattern) or function fnGet<Table>ChangesSec (“global_proxy” replication pattern) to extract changes.
Now if we have DebugLogging enabled for this Vlogs we do log useful information for when we extract the changes.
To enable DebugLogging we can alter the function like below –
From Vlogs (Debug Enabled):
- The extracted changes are saved into #SiteTrackingTable (“Site” replication pattern) or #TrackingTable (“global” or “global_proxy” replication pattern). For e.g. The ‘SoftwareInventory’ table being a site data table has SCCM_DRS.spGetSoftwareInventoryChanges stored procedure that will be called and it will extract the data to the #SiteTrackingTable.
- Call sproc spDRSSendStartMsg to mark “starting message send”.
From Vlogs (Debug Enabled) on initiator site:
Once the Sync start message is processed on receiving site we log.
- If there are changes in #TrackingTable, walk through #TrackingTable to build messages and call proc spDRSSendDataMsg to send out. If there are changes in #SiteTrackingTable and current replication group’s replication pattern is “Site”, walk through # SiteTrackingTable to build messages and call proc spDRSSendBinaryDataMsg to send out.
From Vlogs (Debug Enabled) on initiator site:
Once the SyncData message is processed on receiving site we log.
7. Call proc spDRSSendEndMsg to mark “ending message send”.
From Vlogs (Debug Enabled) on initiator site:
Once the SyncEnd message is processed on receiving site we log.
Now it is important to note that just by sending a SyncEnd message doesn’t mark the completion of the Sync. The sync is not completed until the initiating site receives a SyncComplete message from the receiving site which sends it after it receives the SyncEnd message.
So when we send the SyncEnd message from the Initiating site we update the DRSSendHistory (Leaving the ProcessedTime and SyncComplete as NULL)
Note that the @SyncID for each sync of a replication group is created by the NEWID() inbuild function which creates the GUID. It remains the same for a current sync throughout. For the next sync it would be again created as a different GUID.
Note that the LastProcessedTime and the SyncCompleteTime are not set by the above highlighted procedure in the DRSSendHistory of the Sending table. If you check at the record it would be NULL for those two columns. So how do they get updated?
spDRSSendEndMsg sends the ‘DRS_SyncEnd’ messagetype with the @SyncId to the target site.
- On the Receiving site when the message comes in the ConfigMgrDRSqueue, It invokes the spDRSActivation procedure running as a part of MessageHandlerService which processes DRS_SyncEnd message and updates the DRS_MessageActivity_Receive table with last sync version received.
- The above part we already know but we also send a SyncComplete message as well to the Initiator site which then should mark the completeness of the Sync.
It is in ProcessSyncEnd() as well
- spDRSSendSyncComplete on the receiving site calls EXEC dbo.spSendRcmServiceBrokerMessage @Msg, 'DRS_SyncComplete', 9, @SiteCode
- Note that the DRS_SyncComplete message is an ConfigMgrRCMQueue message and hence the receiving site sends the message to the sending site ConfigMgrRCMQueue.
- ConfigMgrRCMQueue receives the message and the activation procedure for the queue which is [spRCMActivation] is called.
Here is the logic that goes for processing the DRS_SyncComplete on the Initiating Site.
- The final process is that spDRSUpdateSendHistory that simply updates the ProcessedTime and SyncCompletetime on the initiator site for the current sync as below
- So that finally marks a complete Sync and then a given replication group will be deemed as ACTIVE or else it wont be ACTIVE.
Part 3 : Tracking the Sync Handles in Action
It is very important to know how these handles relate to troubleshoot the issue. Because a mismatch here wouldn’t let a sync to complete.
What can cause the handles to mismatch?
- A snapshot restore to an old date DB
- Someone manually cleaning the SSB_Dialogpool when the current sync is in Progress.
So Lets try to track them for say suppose ConfigMgrDRS queue communications for NormalPriority message for Configuration Data.
On the initiator Primary site I check for the current handles for all conversations for Global data
Now these are the ones equivalent to all the global groups which are allowed to send using Normal priority and we can confirm it here
Now how to find which one handle and groupid is being current used by the Configuration Data ?
That’s the second one from the list above we first shared.
Now note that this GroupID is not unique across instances, meaning that we cannot search for the same group id on the receiving site.
There is in fact a different id which is unique [Conversation_ID] and we need to find the corresponding [Conversation_ID] for this group id.
Now we can take this conversation_id and search on the receiving site.
From here we get the conversation_handle and this should match with the handle stored in the DRS_MessageActivity_Receive from what we have learned above for things to work.
And there we go it perfectly matches and so it means sync completed just well.
So now it should be clear that when sync starts and we use a conversation handle, we store it in the DRS_MessageActivity_Receive table and if someone flushes the dialogpool (manually or restoring the snaphot of old db) then this breaks this sync and the handle stored here would not match the new handle.
And we will as a result we uncompleted sync forever in DRSSendHistory where ProcessedTime and SyncCompleteTime would never populate.
There are ways to get out of this state which RLA reciprocates with (exec spFlushSSBDialogPool) which has the logic to get out of this state but don’t run this manually and always use RLA.
If this doesn’t fix the issue then engage Microsoft CSS to deep dive into this who might run some manual queries to get you out of this situation.