How to isolate and troubleshoot synchronize latency issue between master and slave for MySQL

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

In most cases, the latency between master and slave is caused by performance issue. Therefore, when investigating the synchronize lag, checking if workload on master increased will be a good start. In this article, you will learn how to start investigation and isolate the possible cause that impact the replicating latency.

 

Before start, it is critical to understand how replication in MySQL works. MySQL supports different types of synchronization. Azure Database for MySQL only supports the original type of synchronization called asynchronous replication(applied for both data-in replication and read-replica) , which one server acts as a master, while one or more other servers  act as slaves (max 5 replicas supported).

When replicating starts, a master server will write replicating events into the binary log (This log will only record the committed transactions). This is why enabling binary log is required when configuring replication at first place.

On slave servers, there will be two threads running for each replica, one called SQL thread, and the other is IO thread.

  • IO thread will connect to master server and request updated binary log to the master server. Once this thread received the binary log updates, it will save the update in a local log called relay log on a slave server.
  • SQL thread will read the relay log and apply the data change on slave servers.

The mechanism can be referred as following:

 

io.png

 

You can monitor the replication status from Portal Metrics blade. When observed latency occurred and is currently happening, an output of [SHOW SLAVE STATUS;] will give you enough information to understand what may cause the latency.

 

A sample output should be like below:

*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: testserver.mysql.database.azure.com
Master_User: replicationuser@testserver
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000191
Read_Master_Log_Pos: 103978138
Relay_Log_File: relay_bin.000568
Relay_Log_Pos: 13469095
Relay_Master_Log_File: mysql-bin.000191
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: mysql.plugin
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: mysql.\\_\\_%
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 13468882
Relay_Log_Space: 103978599
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File: c:\\work\\azure_mysqlservice.pem
Master_SSL_CA_Path: 
Master_SSL_Cert: c:\\work\\azure_mysqlclient_cert.pem
Master_SSL_Cipher: 
Master_SSL_Key: c:\\work\\azure_mysqlclient_key.pem
Seconds_Behind_Master: 648
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error: 
Last_SQL_Errno: 0
Last_SQL_Error: 
Replicate_Ignore_Server_Ids: 
Master_Server_Id: 943553508
Master_UUID: eba67879-7232-11e9-b113-6c4d3f52df68
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind: 
Last_IO_Error_Timestamp: 
Last_SQL_Error_Timestamp: 
Master_SSL_Crl: 
Master_SSL_Crlpath: 
Retrieved_Gtid_Set: 
Executed_Gtid_Set: 
Auto_Position: 0
Replicate_Rewrite_DB: 
Channel_Name: 
Master_TLS_Version: 
1 row in set (0.00 sec)

 

There will be many information in the output but normally only below columns will be focused:

  • Slave_IO_State: this will tell what the current IO thread status is. Normally, the status will be like "Waiting for master to send event" if it is synchronizing. However, if you see status like "Connecting to master", it means the slave lost connections to the master. Please check if the master is running or firewall blocks the connection. 
  • Master_Log_File: this will tell which binary log file the master is writing.
  • Read_Master_Log_Pos: this represents which position in above binary log file the master is writing. 
  • Relay_Master_Log_File: this represents which binary log file the slave server is reading from master.
  • Slave_IO_Running: this indicates whether IO thread is running. It should be "Yes". If "NO", then most likely the replication is broken.
  • Slave_SQL_Running: this indicates whether SQL thread is running. It should be "Yes". If "NO", then most likely the replication is broken.
  • Exec_Master_Log_Pos: this tells you which position of the above Relay_Master_Log_File the slave is applying. If there is latency, this position sequence should be smaller than Read_Master_Log_Pos.
  • Relay_Log_Space: the upper limit of relay log size. You can check the size by querying [show global variables like "relay_log_space_limit";].
  • Seconds_Behind_Master: this means replication latency in seconds.
  • Last_IO_Errno: IO thread error code if any. The code can be referred at https://dev.mysql.com/doc/refman/5.7/en/server-error-reference.html.
  • Last_IO_Error: IO thread error message if any.
  • Last_SQL_Errno: SQL thread error code if any. The code can be referred at https://dev.mysql.com/doc/refman/5.7/en/server-error-reference.html.
  • Last_SQL_Error:  SQL thread error message if any.
  • Slave_SQL_Running_State: this indicates the current SQL thread status. Please note that "System lock" shown in this state is a normal behavior. And as the example shown above, "Waiting for dependent transaction to commit" is also normal, which means that it is waiting for master to update committed transactions.

Now let's see how to use the above information to isolate the latency cause.

 

If below scenario met, then it is most likely that internet latency is the cause. As seen, the IO thread is running and waiting on master, and master has already written on binary log file #20 while slave can only get file #10. Since the IO thread connects to master via TCP/IP, the only impact factor here is the internet speed.

Slave_IO_State: Waiting for master to send event
Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020
Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010

 

If below scenario met, then it is most likely that the workload on master is too heavy to burden slave. For the case below, though the slave get the binary log behind the master, the slave IO thread indicated that the relay log space is full already. So it is not the internet speed to make the slave get binary log delay because the slave has already been trying to catch up everything it can but the updated binary log size is exceeded the upper limit of relay log space.

Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020
Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010

 

If below scenario met, then it is most likely that the slave is getting slower somehow. For below cases, both IO and SQL thread are running well and the slave is reading the same binary log file as the master writes. However, there is still latency occurred. Normally, this could be caused by slave server itself. Please check if the CPU consumption or IOPS of slave is high.

Slave_IO_State: Waiting for master to send event
Master_Log_File: The binary log file sequence equals to Relay_Master_Log_File, e.g. mysql-bin.000191
Read_Master_Log_Pos: The position of master server written to the above file is larger than Relay_Log_Pos, e.g. 103978138
Relay_Master_Log_File: mysql-bin.000191
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: The position of slave reads from master binary log file is smaller than Read_Master_Log_Pos, e.g. 13468882
Seconds_Behind_Master: There is latency and the value here is greater than 0

Please note that there is one exception for the scenario above. At Azure, the MySQL replication is optimized for high concurrency environment and so the parameter binlog_group_commit_sync_delay is configured. It controls how many microseconds the binary log commit waits before synchronizing the binary log file. The benefit is that, instead of applying every transaction committed immediately, the master will send the binary log update in bulk. It will reduce IO on slave and so to improve performance. However, for some edge cases, which is not a high-concurrent situation, e.g. one connection only run one transaction and then the connection closed, this setting of binlog_group_commit_sync_delay will bring serious latency because IO thread is waiting for bulk binary log update while only one transaction is committed, which means the waiting effort is in vain. 

 

In most cases, the latency cause will fall into the above scenarios. I hope the above information helps. Please feel free to post comments or reach out to me for a discussion.

 

Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.