Why Restarting Azure Database for PostgreSQL took long?

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

PostgreSQL is a transactional database, and it keeps a record for all DML operations and transactions like Update, Insert, and Delete in WAL (Write-Ahead Log) file. WAL logs are PostgreSQL transaction log files used to ensuring data integrity. This log is written after changes to have been applied to the record and eventually once reaching the checkpoint threshold the log file will be flushed from memory to storage to save it permanently.

A checkpoint is a point in the write-ahead log sequence at which all data files have been updated to reflect the information in the log. All data files will be flushed to disk. Refer to WAL Configuration for more details about what happens during a checkpoint.

Why an Azure Database for PostgreSQL restarts?

The Azure Database for PostgreSQL – Single Server service provides a guaranteed high level of availability with the financially backed service level agreement (SLA) of 99.99% uptime. Azure Database for PostgreSQL provides high availability during planned events such as user-initiated scale compute operation, and when unplanned events such as underlying hardware, software, or network failures occur. Azure Database for PostgreSQL can quickly recover from most critical circumstances, ensuring virtually no application down time when using this service. Refer to High availability in Azure Database for PostgreSQL – Single Server

Causes of server restarts:

Planned Server Restarts

User initiated management operation (scale Vcores, pricing tier, etc.)
User Initiated restarts: Users may restart their server if they are performing a server update that requires restart such as changing a static server parameter.
Planned Maintenance: Azure Database for PostgreSQL performs periodic maintenance to keep your managed database secure, stable, and up to date. During maintenance, the server gets new features, updates, and patches. Refer to: Planned maintenance notification in Azure Database for PostgreSQL. You may sign up for notification to be prepared for this scheduled maintenance using this tutorial.

Unplanned downtime
Unplanned downtime can occur because of unforeseen failures, including underlying hardware fault, networking issues, and software bugs. If the database server goes down unexpectedly,

What causes long recovery on Azure Database for PostgreSQL
Recent checkpoints are critical for fast server recovery. Once a restart happens, either it was a new instance (failover to healthy instance) or same instance (in-place restart) will connect to disk that has all logs, all WAL logs after the last successful checkpoint need to be applied to the data pages before the server starts to accept connections again. Those logs are called REDO logs and will be applied via the recovery operation. Applying a WAL log runs through the following steps:

Reading WAL file to see all the transactions inside it and all database objects and correspondently their data pages/blocks that need to be updated.
Fetching those pages/blocks into memory
Updating them using WAL file content.

Recovery time depends on how recent the last checkpoint was and the amount of data inside those log files, that said, the best practice is that application developer needs to avoid log running transactions and tune checkpoint frequency to avoid long recovery.

Checkpoint Frequency:

Checkpoint frequency can be adjusted by configuring server parameters. controlled by checkpoint_completion_target which determines the total time between checkpoints. Another parameter that you may consider is bgwriter_delay which specifies the delay between activity rounds for the background writer.

Long running transactions:
Long running transactions are queries that are running for too long which impact database perfromance and can potentially cause issues during restarts. You may check all running transactions by querying pg_stat_activity, to list queries which running for more than 3 minutes, use the following query:

select current_timestamp-query_start as runtime,

datname,usename, query FROM pg_stat_activity

where state='active' and current_timestamp-query_start> '3 min'

order by 1 desc;

Please note that you can kill any long running PID using pg_terminate_backend. Let’s say you have PID “12345” and you want to kill this process, you may simply run the following query to kill it.

select pg_terminate_backend(pid)

from pg_stat_activity

where pid = ‘12345’;

If you want to kill all process on the server, run the following command:

SELECT pg_terminate_backend(pg_stat_activity.pid)

FROM pg_stat_activity

WHERE pg_stat_activity.datname = 'TARGET_DB' -- ← change this to your DB

AND pid <> pg_backend_pid();

How to Prepare for Azure Database for PostgreSQL planned server restart

Now that we have learned what causes long recovery, we will need to prepare for planned restarts either it was user initiate or system initiated.

Ensure no long running transactions.
As discussed earlier recovery rolls back all inflight but uncommitted transactions at the time of restart, and during this time the database is unavailable for additional requests. If there are any large transactions to rollback this activity takes time proportional to the size of the active transactions, check for long running transactions before initiating a restart and/or before maintenance window.
Stop or reduce the application intensity workload.
This should reduce the amount of traffic on the server, if you are planning to resize your server or know that maintenance is happening on your server, consider either stopping your application or reduce the amount of workload at that time which will significantly reduce the down time for the server.
Force a manual Checkpoint.
Manual checkpoint or even a recursive manual checkpoint helps tremendously in reducing recovery time. Please note that checkpoint is per database and not per server, if you have more than one database in your Azure Database for PostgreSQL server, consider running double checkpoints back-to-back before initiating the restart.

Please note if your server is under status restarting, scaling, restoring please avoid initiating another restart, this will cause longer recovery. You may check the server status on Azure Portal Overview blade for your Azure Database for PostgreSQL Server.

Leave a Reply Cancel reply