PostgreSQL Making Index creation faster

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

This blog post is your essential guide to enhancing the speed and efficiency of your index creation process in Azure Database for PostgreSQL - Flexible Server databases. Discover a wealth of practical tips and strategies designed to streamline your approach to index management. From tuning server parameters for optimal performance to cleaning up excess space on tables.

 

You will see we ensure a smooth and efficient index creation experience. You will also be driven  into the importance of monitoring and resolving potential blockers that could hinder the index creation process, as well as the strategic timing of these operations to minimize their impact on other database activities. 

 

Reading this post will arm you with the knowledge and tools needed to avoid common pitfalls and ensure your databases run more efficiently. Whether you're a seasoned database administrator or just starting out, the insights provided here are for you, if you are looking to improve your database performance through effective index management.

 

Here are the best practices which you are going to encounter in the blog post below. To know more about each one of them, scroll down.

  1. Two important parameters to tune
    • maintenance_work_mem
    • Parallelizing the index creation process
  2. Cleaning table bloat

  3. Identifying and Resolving Blockers

  4. Scheduling index creation

 

Index Creation

The following best practices would enable you to expedite the index creation process.

 

1. Two important parameters to tune

 

maintenance_work_mem  

maintenance_work_mem, impacts the performance of any maintenance tasks by allocating specified memory from RAM. The default value for this parameter in Azure Database for PostgreSQL - Flexible Server varies based on SKU and the upper limit for the parameter is '2097151kB'(2 GB). You can increase the parameter value for fast execution of maintenance activities (adding indexes/vacuuming etc.) using the Azure Portal.


An example to check on the impact of maintenance_work_mem. In this example, using the below code you can create an index on a 4 GB table with the structure below.

 

CREATE TABLE IF NOT EXISTS ( a1 integer, b1 integer, c1 integer, d1 integer, e1 text , f1 text , g1 date ); Using the below query we will be loading the table with size 4 GB. INSERT INTO test_data SELECT n, n, n, n, 'abcdefghijklmnopqrstuvwxyz', 'abcdefghijklmnopqrstuvwxyz', now() FROM generate_series(1,20000000) n; Use the query below to check on the size of the table SELECT relname AS table_name, pg_size_pretty(pg_relation_size(relid)) as "Actual Size" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; postgres=# \timing; Timing is ON postgres=# SHOW maintenance_work_mem; maintenance_work_mem ---------------------- 274MB (1 row) Time: 0.254 ms postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 21987.553 ms (00:21.988) postgres=> set maintenance_work_mem='2097151kB'; SET Time: 44.165 ms postgres=# SHOW maintenance_work_mem; maintenance_work_mem ---------------------- 2097151kB 1 row) Time: 0.229 ms postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 19396.686 ms (00:19.397)

 

Creating an index with maintenance_work_mem='274MB' (which is the default value for a 4 vCore SKU) took around 21s 988ms and the same with maintenance_work_mem='2097151kB'(2 GB) took 19s 397ms.

 

On Azure portal, when you change the server parameter, we do not accept units '2097152kB' (2GB), so I am using the above units instead. If you are using a session, you do not need to convert to these units.

 

Parallelizing the index creation process

 

Parallel index creation is supported in +PG11versions. This could be utilized by using the parameter max_parallel_maintenance_workers impacts the performance of the maintenance tasks by providing multiple workers to parallelize the execution of maintenance tasks. In this scenario, increasing this parameter will help improve the execution time of the index creation in most cases. You will see the below demonstration shows runtime with max_parallel_maintenance_workers when set to 2 and 8 respectively.

 

postgres=> SET max_parallel_maintenance_workers=2; SET Time: 26.144 ms postgres=# SHOW max_parallel_maintenance_workers; max_parallel_maintenance_workers ---------------------------------- 2 (1 row) postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 21987.553 ms (00:21.988) postgres=> SET max_parallel_maintenance_workers=8; SET Time: 25.960 ms postgres=# SHOW max_parallel_maintenance_workers; max_parallel_maintenance_workers ---------------------------------- 8 (1 row) postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 12818.791 ms (00:12.819)

 

 

2. Cleaning table bloat

 

Before creating an index, make sure your table isn't wasting space, which can happen when data is deleted or updated. Clearing out this unnecessary space by running a VACUUM operation on the table will help organize its storage better.

 

VACUUM (ANALYZE, VERBOSE) <tablename>;

 

You can introduce bloat on the table using the Update script below which generated 18% bloat if you follow the above example with data load. Your autovacuum will run if your bloat is greater than autovacuum_vacuum_scale_factor. The default is 0.2 (20%).

 

UPDATE test_data SET g1=now() WHERE a1 <= 3600000;

 

Here is the query to run and check on dead_pct for the table.

 

SELECT schemaname, relname, n_dead_tup, n_live_tup, round(n_dead_tup::float/n_live_tup::float*100) dead_pct ,autovacuum_count , last_vacuum, last_autovacuum ,last_autoanalyze,last_analyze FROM pg_stat_user_tables WHERE n_live_tup > 0 and relname= 'test_data' ORDER BY n_live_tup DESC;

 

 You can now run the create index query with 18% bloat and 0% bloat.

 

SELECT schemaname, relname, n_dead_tup, n_live_tup, round(n_dead_tup::float/n_live_tup::float*100) dead_pct ,autovacuum_count , last_vacuum, last_autovacuum , last_autoanalyze,last_analyze postgres-# FROM pg_stat_user_tables postgres-# WHERE n_live_tup > 0 postgres-# ORDER BY n_live_tup DESC; [ RECORD 1 ]----+------------------------------ schemaname | public relname | test_data n_dead_tup | 3672951 n_live_tup | 19927074 dead_pct | 18 autovacuum_count | 5 last_vacuum | last_autovacuum | 2024-02-26 23:17:24.984945+00 last_autoanalyze | 2024-02-26 23:17:25.538443+00 last_analyze | Time: 28.900 ms postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 17547.189 ms (00:17.547) postgres=# VACUUM (ANALYZE, VERBOSE) test_data; INFO: vacuuming ".test_data" INFO: scanned index "idx_test_data_id" to remove 3600000 row versions DETAIL: CPU: user: 1.93 s, system: 0.15 s, elapsed: 2.97 s INFO: "test_data": removed 3600000 row versions in 48000 pages DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.67 s INFO: index "idx_test_data_id" now contains 20000000 row versions in 76901 pages DETAIL: 0 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: "test_data": found 0 removable, 11903000 nonremovable row versions in 206707 out of 314667 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2640 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 2.75 s, system: 1.18 s, elapsed: 6.36 s. INFO: vacuuming "pg_toast.pg_toast_17462" INFO: "pg_toast_17462": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 2640 There were 0 unused item identifiers. Skipped 0 pages due to buffer pins, 0 frozen pages. 0 pages are entirely empty. CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s. INFO: analyzing " test_data" INFO: "test_data": scanned 30000 of 314667 pages, containing 1903250 live rows and 0 dead rows; 30000 rows in sample, 19962999 estimated total rows

 

You can now try to run the create index command with a table having 0% bloat.

 

postgres=# SELECT schemaname, relname, n_dead_tup, n_live_tup, round(n_dead_tup::float/n_live_tup::float*100) dead_pct ,autovacuum_count , last_vacuum, last_autovacuum , last_autoanalyze,last_analyze postgres-# FROM pg_stat_user_tables postgres-# WHERE n_live_tup > 0 postgres-# ORDER BY n_live_tup DESC; -[ RECORD 1 ]----+------------------------------ schemaname | public relname | test_data n_dead_tup | 0 n_live_tup | 19999803 dead_pct | 0 autovacuum_count | 1 last_vacuum | last_autovacuum | 2024-02-21 22:29:49.513376+00 last_autoanalyze | 2024-02-21 22:29:49.930922+00 last_analyze | Time: 32.305 ms postgres=# CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); CREATE INDEX Time: 12873.053 ms (00:12.873)

 

Executing the Create Index command with 18% bloat took around 17s 547ms ms to complete whereas without bloat the index creation completed in 12s 873ms.

 

3. Identifying and Resolving blockers

 

Keep an eye on the index creation process to make sure there are no obstacles that might slow it down. To find obstacles, you can run the following query to see if there are any blocking processes and what query they are related to:

 

To create a blocker, I am running the Insert query at the beginning of the blog and then trying to create an index at the same time in 2 different sessions.

 

Session1: INSERT INTO test_data SELECT n, n, n, n, 'abcdefghijklmnopqrstuvwxyz', 'abcdefghijklmnopqrstuvwxyz', now() FROM generate_series(1,20000000) n; Session2: CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); Session3: postgres=> SELECT activity.pid, activity.usename, activity.query, blocking.pid AS blocking_id, blocking.query AS blocking_query FROM pg_stat_activity AS activity JOIN pg_stat_activity AS blocking O N blocking.pid = ANY(pg_blocking_pids(activity.pid)); -[ RECORD 1 ]--+------------------------------------------------------------ pid | 5849 usename | postgres query | CREATE INDEX idx_test_data_id ON test_data(a1, b1, c1, d1); blocking_id | 27582 blocking_query | INSERT INTO test_data + | SELECT n, + | n, + | n, + | n, + | 'abcdefghijklmnopqrstuvwxyz', + | 'abcdefghijklmnopqrstuvwxyz', + | now() FROM generate_series(1,20000000) n; Time: 30.898 ms postgres=>

 

As shown in the screenshot above, you will have to end the 'blocking_id' process to allow the Create Index to finish its operation.

 

To do that use the query below to terminate the process.

 

SELECT pg_terminate_backend(27582);

 

This would remove the obstacle and speed up the index creation. Removing any found obstacles quickly will make the index creation faster.

 

To investigate the process of the index creation process you can use the below query.

 

select * from pg_stat_progress_create_index;

 

4. Scheduling Index creation

 

Pick a time when the server is not busy to make the index. This way, you can reduce the effect on other database operations by doing the CREATE INDEX operation when the workload is low to speed up the index creation.

 

Conclusion

 

As a database administrator, you can follow the above best practices to create Postgres indexes in an efficient and optimized way, which will improve the performance and speed of your database and data retrieval.

 

If you have any questions, don't hesitate to reach out to our Postgres team at Microsoft at Ask Azure DB for PostgreSQL

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.