This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
No matter where you are running production PostgreSQL—on a VM or with PaaS in the cloud—at a high level there are four key system metrics to keep your eye on: CPU, memory, IO, and storage used. On a regular day, these metrics can be the leading indicators of something going wrong on your Postgres server. And when you're troubleshooting a thorny issue, they can point you in the right direction to focus your investigation.
Let's look at what each utilization metric might be telling you about Postgres, and how you can dive deeper:
CPU—not all high CPU is bad
A healthy average for CPU is typically lower than 75%, although you may see occasional brief fluctuations, which are fine. If, however, you find that CPU is steadily increasing over a longer period or frequently 85% or higher, you should investigate. A CPU that busy will result in slower queries.
If CPU starts to grow unexpectedly, one likely suspect is connections. Is there an increase in the number of connections to the database? You should also look at the frequency of new database connections. Creating many short-lived connections has a significant overhead on the server. As it happens short and frequent connections are the hallmark of most web applications. Use a connection pooler like pgbouncer to mediate between apps and the database.
When CPU is high, it's helpful to narrow down what queries are running. You can do this in 3 different ways:
- The pg_stat_activity view provides rich details on each running Postgres process, including current query, state, and wait events. pg_stat_activity is on every Postgres server without any additional setup.
- Another useful tool is pg_stat_statements which is an extension that ships with Postgres. You need to preload this extension to start using it. While pg_stat_activity provides general process info, pg_stat_statements dives deeper into statistics for each query, such as min and max time spent.
- In addition to the two views above, Azure Database for PostgreSQL users also have access to Query Store for historical query stats. Query Store chunks query statistics into time windows so you can see how a query’s performance changes over time. For example, if you know that CPU% started climbing at 10am, you can compare statistics from queries before 10am versus afterwards, making it easier to narrow down where to investigate.
We recommend Query Store as the preferred query statistics tool for Azure Database for PostgreSQL customers because it adds less overhead to the server and provides a historical view.
You should also watch out for any transaction that is running for longer than you expect it to. The transaction may be stuck and hogging CPU. This SQL query will help you identify the current long running processes on your Postgres server:
Memory—don’t forget to free up unused memory
Connections are a frequent culprit when memory is tied up. Typically, every Postgres connection, even the idle ones, occupies about 10MB of memory. You may not know you have idle connections because they are being created on your behalf by your framework of choice. To check the number of active and idle connections to your database, use this SQL query:
Connection pooling can help you keep idle memory in check. But remember that not all connection poolers are created equal. A Postgres-aware pooler like pgbouncer releases connections when they are no longer needed, keeping the number of idle connections down.
While you investigate what’s holding on to memory, consider the number of schemas that your connections are accessing. The 10MB estimate mentioned above can be much larger because a connection caches information about each table it has encountered.
Memory is also consumed by the different workers that run on your server, including background workers, and autovacuum workers. Remember that unlike shared_buffers, which is a fixed value for shared memory across the server, memory parameters like work_mem, maintenance_work_mem, and autovacuum_work_mem are individual to processes. This means that the setting is multiplied across the number of active processes that need that kind of memory.
IO—a low cache hit ratio could show the cause of high IO
Use pg_stat_statements (or Query Store) to identify which queries have the most effect on IO. They will be the queries that have the highest number of blocks read, written, or dirtied. You can use Postgres’s EXPLAIN ANALYZE to examine the actual execution plans for these queries, watching out for those that do full table scans on large tables.
For read-heavy queries, ideally your cache should be well-utilized, reducing expensive access to disk. Calculate your cache hit ratio (the ratio of blocks read from cache to total blocks read) with this SQL query:
A low cache hit ratio may indicate the cause of high IO on your server. Looking at your cache hit ratio alongside your query plans will help you gauge whether a little more cache will help fit queries in memory or if the queries themselves need to be optimized.
Storage—high storage utilization means you’re only a spike away from trouble
Of all the metrics, storage is the one I would not want to see averaging near high percentages. If storage is filling up and you're sure it's not due to increased inserts from your workload, double check autovacuum. Vacuuming is how Postgres tidies up its concurrency control system. If autovacuum isn't keeping up with the rate of change in a database, dead rows accumulate and bloat happens. Dead rows can also slow down your queries. Use this SQL query to identify potential bloat:
The pg_stat_user_tables view also provides information about how recently and how frequently autovacuum has run on your tables. You can customize autovacuum settings for your whole server or individually for busy tables that need vacuuming more frequently.
If you use replication or logical decoding, growing storage utilization can be a sign that the Postgres write ahead logs (often called WAL; these WALs are how changes are replicated) are not being consumed by a replica or decoding client and therefore cannot be deleted. If your server is busy, this can quickly fill up storage.
Check the lag in bytes of your replicas/standbys:
or on Postgres 9.6 and below:
And see if you have any unused replication slots:
Alerts—keeping an eye on metrics for you
Maybe you already knew or agree that CPU, memory, IO, and storage utilization are four important metrics to keep your eye on. In that case, you definitely have alerts configured to monitor these, right? No…? Why not?
You probably don't have someone hired to sit all day staring at metric graphs watching for danger. Alerts are that 'someone' you can hire. You should be using a monitoring framework with your Postgres server that offers alert configuration. (Azure Monitor is that framework for Azure Postgres users).
Now that you’ve resolved to use alerts, how should you decide the thresholds and trigger points? A basic approach would be to set an alert at 85% utilization for all four metrics. That is at least better than having no alerts at all. You’ll be notified (just) before things go haywire and maybe you’ll be able to get ahead of them. Maybe.
But consider how quickly you can respond to an alert—both the time it takes to see the alert and the length of time needed for resolving the issue. You want to be notified with enough time to troubleshoot and mitigate before the situation goes from bad to worse.
To set even better alerts, make them fit your Postgres server’s utilization. Start by looking back at a week to a month's worth of metrics. Understand what your utilization looks like when things are normal versus when there's something wrong. Pay attention to averages, but also observe whether your metrics tend to have sharp spikes or sustained peaks.
Don’t agonize long over this though. Set up some alerts and you can fine-tune them later as you observe whether they are working for you.
---
Perhaps after taking Postgres’s temperature and troubleshooting you’ll find out that there’s no rogue process—your Postgres server is just more busy than it used to be, and needs more CPU, memory, IO, or storage to keep going. By understanding these four system metrics, you’ll be able to provision the right amount of resources to meet your workload’s growth.
I think there’s a typo here: “SELECT relid, relname, n_live_tup, n_dead_tup, FROM pg_stat_user_tables;” for me I had to remove the comma before “FROM”, so the working command would be: “SELECT relid, relname, n_live_tup, n_dead_tup FROM pg_stat_user_tables;”.