Persistent Chat room lock-up and become unavailable when a user is either added/removed from the Room/Category

This post has been republished via RSS; it originally appeared at: Skype for Business Blog articles.

First published on TECHNET on Feb 16, 2018

The latest update for Lync Server 2013 ( July 2017 ) has the following fix



The update for Skype for Business Server 2015 ( May 2017) has the following



It could happen that though the updates are installed ( or a higher CU) is installed, the issue could persist in the environment.




Log Name:      Lync Server

Source:        LS Persistent Chat Server

Date:          10/10/2017 1:01:02 PM

Event ID:      53508

Task Category: (1098)

Level:         Error

Keywords:      Classic

User:          N/A

Computer:      PCHATServer.contoso.com

Description:

Failed to release the admin lock. Administrative command processing cannot proceed.








Log Name:      Lync Server

Source:        LS Persistent Chat Server

Date:          10/10/2017 5:44:36 AM

Event ID:      53555

Task Category: (1098)

Level:         Warning

Keywords:      Classic

User:          N/A

Computer:      PCHATServer.contoso.com

Description:

An inconsistent state between the server cache and the database was detected and the server cache will be reloaded.





The Persistent Chat server will reload its cache from the database.

Cause: This can be caused by Persistent Chat servers failing to communicate with each other.








Log Name:      Lync Server

Source:        LS Persistent Chat Compliance Server

Date:          10/8/2017 1:29:31 PM

Event ID:      53106

Task Category: (1097)

Level:         Error

Keywords:      Classic

User:          N/A

Computer:      PCHATServer.contoso.com

Description:

Unable to save message 10/8/2017 8:24:59 PM PART ma-chan://contoso.com/6f41dceb-69ae-434a-9699-123e8eb5f675  0 39000 to database due to exception:

CmdID: c5409a64-b11d-4d49-90f5-fa694cd4555f The server could not restore db connection within the allowed time (00:10:00) using connection string: Data Source=sql01.contoso.com\RTC;Initial Catalog=mgccomp;Integrated Security=SSPI;Failover Partner=sql02.contoso.com\RTC. at

at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.Database.DbCommand.executeUntilSuccessOrTimeout[TR](Fun`2 executeDelegate, RetryInfo retryInfo)

at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.Database.DbCommand.executeImp[TR](Fun`2 executeDelegate, Int32 retryTimeoutInMs)

at Microsoft.Rtc.Internal.Chat.Server.ServerCommon.Database.DbCommand.ExecuteNonQuery(Int32 retryTimeoutInMs)

at Microsoft.Rtc.Internal.Chat.Server.Compliance.ComplianceDataAccess.Save(RawComplianceData data)

at Microsoft.Rtc.Internal.Chat.Server.Compliance.ComplianceServer.Save(RawComplianceData data).







This issue stems from design and from scalability. When Persistent Chat servers were designed it wasn't expected that users would be removed/added on continual basis. Also to ensure that only participants who are in the chat room have access, even though a single user was added/removed, we verify the permissions for every user and every category and every chatroom. This works well in small environments, but as the usage scales, the solution fails to scale. Now, about the trade-off, we added a new flag that can be modified to change the behavior, where no checks are performed and the actions are simply implemented. What does it mean in daily usage, if a user was removed from say a chat room, under the current scenario, the chat room access is also removed from the client immediately. The trade-off that businesses will now have to make is for performance, and to prevent SQL lock-ups, that may have to wait for a client to sign-out and sign-in, causing access to the chat room to be revoked.




RESOLUTION:




Connect to MGC database in your environment and then get me the contents of the dbo.tblConfig table. It should be like





configLabel         configPoolID                                                                  configContent





pool                     9CFB3493-89B2-447C-8487-9C19C13E1694           < ?xml version="1.0"....





We are interested in the ConfigContent. It should look like








<?xml version="1.0" encoding="utf-8" standalone="yes"?>


< configuration version="1">


< pool>


< db>


< retry_ms>600000</retry_ms>


< lossdetection_ms>120000</lossdetection_ms>


< /db>


< channelserver>


<ADConnect>


< GlobalCatalog>


< findgc>True</findgc>


< host></host>


< adsynchfreq>480</adsynchfreq>


< /GlobalCatalog>


< /ADConnect>


< adupdate>


< batchsize>5000</batchsize>


< sleeptime_ms>10000</sleeptime_ms>


< accesspoll_ms>604800000</accesspoll_ms>


< accesspoll_size>50</accesspoll_size>


< accesspoll_enabled>False</accesspoll_enabled>


< /adupdate>


< serverbackchat>


< cache_size_limit>2500000</cache_size_limit>


< /serverbackchat>


< watermarks>


< batch_message_count_max>20</batch_message_count_max>


< async_send_max>100</async_send_max>


< async_send_max_lo>90</async_send_max_lo>


< outbound_queue_max>100000</outbound_queue_max>


< outbound_queue_max_lo>90000</outbound_queue_max_lo>


< low_priority_queue_max>500</low_priority_queue_max>


< inbound_queue_size_max>10000</inbound_queue_size_max>


< channelinvitemax>50</channelinvitemax>


< /watermarks>


< /channelserver>


< webservice>


< maxchunksizeinkb>1024</maxchunksizeinkb>


< /webservice>


< /pool>


</configuration>








Please see highlighted section in Yellow. We will need to edit the contents to insert the line <notify_users>0</notify_users> at that particular location. Once this is done, we would recommend to restart the services for PCHAT.












<?xml version="1.0" encoding="utf-8" standalone="yes"?>


<configuration version="1">


<pool>


<db>


<retry_ms>600000</retry_ms>


<lossdetection_ms>120000</lossdetection_ms>


</db>


<channelserver>


<ADConnect>


<GlobalCatalog>


<findgc>True</findgc>


<host></host>


<adsynchfreq>480</adsynchfreq>


</GlobalCatalog>


</ADConnect>


<adupdate>


<batchsize>5000</batchsize>


<sleeptime_ms>10000</sleeptime_ms>


<accesspoll_ms>604800000</accesspoll_ms>


<accesspoll_size>50</accesspoll_size>


<accesspoll_enabled>False</accesspoll_enabled>


</adupdate>


<serverbackchat>


<cache_size_limit>2500000</cache_size_limit>


<notify_users>0</notify_users>


</serverbackchat>


<watermarks>


<batch_message_count_max>20</batch_message_count_max>


<async_send_max>100</async_send_max>


<async_send_max_lo>90</async_send_max_lo>


<outbound_queue_max>100000</outbound_queue_max>


<outbound_queue_max_lo>90000</outbound_queue_max_lo>


<low_priority_queue_max>500</low_priority_queue_max>


<inbound_queue_size_max>10000</inbound_queue_size_max>


<channelinvitemax>50</channelinvitemax>


</watermarks>


</channelserver>


<webservice>


<maxchunksizeinkb>1024</maxchunksizeinkb>


</webservice>


</pool>


</configuration>








For user removals, it could be possible that you could run Revoke-csClientCertificate for the removed user, and the user will be signed-out from all end-points that do not use UCWA. They can then sign-in and continue using the service. This commandlet may disrupt the calls and conferences or IM conversations the user is on.





Please check your business requirements and the available trade-offs to decide if you want to proceed with altering the configuration. Also note that a service restart is required.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.