The Persistent Chat database connection was lost – Event ID 53530

This post has been republished via RSS; it originally appeared at: Skype for Business Blog articles.

First published on TECHNET on Apr 16, 2016
We have seen the below issues with customers using PCHAT/Group Chat;

Issue 1: LYNC client would time and again display an error stating Persistent Chat lost connection to the Database.

Issue 2: Pchat servers Crash/Hang or loose connection to the SQL database Often

Issue 3: We see Event ID 53530 Fairly regularly on the PCHAT Servers with Error Description "The Persistent Chat database connection was lost"

Issue 4: We also see the following Warning Events on the Persistent Chat servers time and again,

Notice: Info 9/29/2014 10:38:29 AM customer.com domain is no longer busy and can send us incoming messages normally.

Notice: Warning 9/29/2014 10:38:13 AM customer.com domain is too busy to send us incoming messages.

Notice: Info 9/29/2014 10:38:01 AM customer.com domain is no longer busy and can send us incoming messages normally.



We have come across Multiple cases where customers with large AD infrastructure were seeing one of the above Issues/Symptoms on their PCHAT Servers.

Although some of these warnings self-correct themselves within a very short period of time we tried to focus on finding out why we see these warnings/errors. Because it seemed like every time these errors occur  they would cause some or the other issue and the behavior was very Intermittent.



While working on the issue with help of SQL engineers we found that there are several Instances where we see SQL Deadlocks.

When we Investigated further we found that a Certain Stored procedure [dbo].[procProcessADUpdates] seems to be trying to read certain Tables in SQL for a span of about 5-10 minute and this causes the deadlocks.

We then tried to correlate to see if the time when the Deadlocks happen matches with the time we see the errors in the Event logs on the Persistent Chat servers.

To find this we ran SQL PSSDIAG only configured to log instances where this specific Stored Procedure [dbo].[procProcessADUpdates] ran and later analyzed the Results.



Our Priority was to establish a confirm cause and effect relationship between the Errors happening on Persistent Chat and Events taking place on SQL at the same time.

We were able to identified a set Pattern on when the Errors appear on the Persistent Chat server.

The errors happen for a continuous interval of 10 minutes and they do not happen after that for about 101-15 minutes, then they reappear and constantly happen for another period of 10 minutes.



We collected Event logs from the Persistent Chat server and Traces for ProcessADUpdate stored procedure on SQL server.

These errors happened on the Persistent Chat server several times.

The Stored Procedure ran on SQL several times.



The Traces on SQL were collected between 10 am and 4 PM

Between 10 am and 4 PM there were several times when the stored procedure ProcessADUpdate on SQL ran but did not complete



First Occurrence of Stored Procedure running on SQL

Start time                                End time

10:12:09.080                          10:22:09.103



We saw the Event errors on PERSISTENT CHAT Server several times within this 10 minute window when the stored procedure was running between 10.12 am and 10.22 am.

Pic1

If you see the Highlighted Section above you will see that the events happened between 10.13 am and stopped at 10.20 am. This is exactly the time when the Stored procedure started and stopped. After 10.20 am we did not have any events till 10.33 am.



The next error event was logged at 10.33 am and we kept getting these errors until 10.41 am

Pic2

So we saw the errors again between 10.33 and 10.41 am and this is exactly the time when the stored procedure started again.

Start Time         End Time

10:32:26.223    10:42:26.237



Below are all the times when the Stored procedure Ran on SQL

Duration          Start time                                  End time                                           Object name

600022417      2015-01-29 10:12:09.080       2015-01-29 10:22:09.103            procProcessADUpdates

600013203      2015-01-29 10:32:26.223       2015-01-29 10:42:26.237            procProcessADUpdates

600104326      2015-01-29 10:52:44.880       2015-01-29 11:02:44.983            procProcessADUpdates

600015001      2015-01-29 11:12:56.007       2015-01-29 11:22:56.023            procProcessADUpdates

600022547      2015-01-29 11:33:15.580       2015-01-29 11:43:15.603            procProcessADUpdates

600022349      2015-01-29 11:53:29.437       2015-01-29 12:03:29.460            procProcessADUpdates

600024296      2015-01-29 12:13:33.150       2015-01-29 12:23:33.173            procProcessADUpdates

600058078      2015-01-29 12:34:17.830       2015-01-29 12:44:17.887            procProcessADUpdates

600044520      2015-01-29 12:54:58.053       2015-01-29 13:04:58.097            procProcessADUpdates

600045646      2015-01-29 13:15:04.677       2015-01-29 13:25:04.723            procProcessADUpdates

600071117      2015-01-29 13:35:07.493       2015-01-29 13:45:07.563            procProcessADUpdates

600039783      2015-01-29 13:57:46.727       2015-01-29 14:07:46.767            procProcessADUpdates

600014348      2015-01-29 14:18:06.810       2015-01-29 14:28:06.823            procProcessADUpdates

600014660      2015-01-29 14:38:25.600       2015-01-29 14:48:25.613            procProcessADUpdates

600018793      2015-01-29 14:58:44.330       2015-01-29 15:08:44.347            procProcessADUpdates

600008529      2015-01-29 15:18:51.800       2015-01-29 15:28:51.807            procProcessADUpdates

600011609      2015-01-29 15:39:00.673       2015-01-29 15:49:00.683            procProcessADUpdates



And on all these times we saw errors on the Persistent Chat servers exactly.

This gave us a Proof that the running of this stored procedure to some extent is directly proportional to the Error Events on the FE server in OUR case..

Which then led us to find out more information on the Stored Procedure that was causing this issue,



About sproc ‘inprocProcessADUpdates’

In the Persistent Chat server, an ActiveDirectoryWatcher’s job is to periodically (default is every 10 minutes) query the AD for any updates and if there are any, the corresponding principals are then updated in the Lync DB and also updates the AD sync cookie. The sproc ‘inprocProcessADUpdates’ finds the principals that were recorded as out of sync with AD and updates the data in the Lync database. The default timeout given for executing this SPROC is 10 minutes which is hardcoded in the code. By ‘hardcoded’ I mean this value cannot be customized. While working with the Product developers we learnt that “this command could take a long time to process based on the complexity of the data”. This is the Reason why a Time Out value was configured to ensure the SPROC does not keep running forever. The Time Out Value as mentioned is 10 minutes. It is NOT recommended to change the Time Out value of this SPROC. However we do Have the ability to Change the Frequency of its Run, meaning we can change how often we want it to Run. Based on the code this can be done by modifying the AD sync frequency timer by setting the adsynchfreq attribute in the server’s configuration file (.exe.config).



About sproc ‘procGetDirtyPrincipalIds’

The sproc gets a list of all principals that are dirty (principals that are out of sync with AD) in the Lync DB and need to be refreshed from AD. It gets batch of dirty principals from the DB at a time and updates those. Default batch size is 20 but is configurable by setting “adupdate/batchsize” in the config file for the server.



Based on the above information about these Stored procedures we learnt the following,

  1. There is no way to modify the time out time of the stored procedure ‘inprocProcessADUpdates’

  2. We do have the ability to modify how often it runs.




Recommendation:

If you are Witnessing the same symptoms as described in the beginning of this Article under Symptoms. Then It is possible you are also running into this problem.

In that case We would recommend the following.



Monitor the Run Cycle of the Stored Procedure [dbo].[procProcessADUpdates]

To confirm that this Stored Procedure is creating an issue intermittently in your PCHAT Environment you can engage a SQL engineer and Run PSSDIAG for the Stored Procedure [dbo].[procProcessADUpdates]

You can run this for the entire day. At the end of the day Note down The Total number of times the SPROC ran and then find out How many times out of that did it Time out at the 10 Minute Mark.

You will be able to see the Exact start and Stop times of the SPROC and for the instances where it Ran for entire 10 Minutes and timed out, Check the Application Event logs on the PCHAT server to see if you had any warnings/error events relate to database connectivity being lost/unavailable at the same time.

Even If you don’t see Errors/warning in the Event logs but you do see that the SPROC ran Multiple Times and timed out after 10 minutes on most occasions then I would still consider this to be a problem that can cause database connectivity issues, Performance issues etc.



Chances are that in a Large AD infrastructure you may encounter this scenario where this Particular SPROC does not complete its execution in 10 minutes and hence times out. The problem is That every time this SPROC runs it Locks certain SQL tables and these tables become unavailable for that amount of time until the SPROC finishes. As a result if some other Application/Sproc tries to access these tables at the same time it will NOT be able to and this will cause Intermittent connectivity issues. You may not be able to find a way to speed up the SPROC such that it completes every time within 10 Minutes or Never times out. Hence to Solve this issue one approach you can use is to Modify the Frequency of How Often this SPROC should run.



By Design This SPROC is programmed to Run Every 10 Minutes. Since this is a Very Intensive Operation it is Ideal NOT to run this SPROC every 10 Minutes specially in a Large AD environment.

So in Short, If you find out that This SPROC Runs Multiple times in your Set up and Does NOT complete every time successfully but Times OUT often then It is Advisable to Change the Frequency of how Often it Runs.

To do this you can follow the Below Instructions,



Modify the Frequency of the SPROC [dbo].[procProcessADUpdates]



Step 1:

Stop LYNC Services on the PCHAT Server

Stop-cswindowsService



Step 2:

Modify the Frequency of the Stored procedure

On the SQL server hosting the Persistent Chat Database (MGC)

Expand the MGC database

Tables

Select dbo.tblConfig

Right Click dbo.tblConfig and select edit top 200 Rows

On the results window under ConfigContent you will see a Long XML file

In this section Modify the Attribute (numerical value) between <adsynchfreq> 10  </adsynchfreq> In your Environment this value will be set at 10 change it to 480 this will mean that the Stored Procedure will run every 8 hours.



[NOTE: There is NO Set Guidance on what this Value should be changed to. You have to make this decision based on your Requirement. Every time you make a change in your Environment related to PCHAT (for example - add a user to Room, Create a Room etc.) This Value Decides how soon will that change take effect. If you do not make changes that often to the PCHAT Settings than it is best to keep this Value as high as possible to avoid Unnecessary running of the Stored Procedure as it is a very intensive Operation. In My example I have set the duration to 480 minutes which is 8 Hours]



Once you make this change Exit out of SQL Server Management studio and re-open it and verify the change stays.

Refer Screen Shot below for reference

pic3



Step 3:

Start LYNC services on the Persistent Chat Servers

Once the Above changes are made you would need to Restart the Persistent Chat services on the Persistent Chat server that is currently Live in your environment.

You can either Restart the Persistent Chat server itself or just restart the Persistent Chat services by running the following commands from the windows powershell

Start-cswindowsService



Step 4:

Note Down the EXACT Time when the Server/Services are restarted. (Very Important)

This is the time that will be used as the Reference point by the Stored Procedure to decide when should it run Next.

For Example If you set the <Adsynchfreq> to 480 (8 hours) and you restarted the Server at 1 PM. Then the Stored Procedure will Run After 8 hours from this point onwards so the Next run will be at 9 PM.

I would recommend you restart the Persistent Chat service at a time that will make sure that the Sproc does not run during normal/busy business hours.



Step 5

Repeat Steps 1-5 On all PCHAT Servers in your Environment.



THE END





Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.