Using partitioned tables or multiple tables to improve big deletes performance

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

 

There are cases in which the time It takes to perform the delete operations are higher than what is acceptable, even if there are no resources constraint and the execution plan is good. In such a cases, we may consider reviewing the process.

 

I will show two options to achieve this, one using multiple tables and other using a partitioned table, which will not only improve the performance, but also reduce transaction log consumption, table fragmentation and eliminate the need to scale up the database and reduce the costs or prevent increasing It.

 

I have provided some examples, but please keep in mind you should implement your own process and test It in a non-production environment first.

 

For a simple demonstration, we will consider a table called ‘Log’ which may contain several columns, but the ones we are most interested in are: ‘ID’, that is an int identity(1,1) column and ‘insert_time’, which is a datetime column that stores when the data was inserted. The only index present is the primary key PK_Log on ID. We will perform the clean-up task once a month and delete everything that is older than 2 months.

 

  1. Store the data by month in different tables

 

We could switch between 3 tables once a month and truncate the one that contains the older data. We can have a view with the union all of the 3 tables if necessary.

 

If the application only writes to this table and It is only read manually when necessary, it may not be necessary application changes. However, if the application also needs to read the data, it may be necessary to make a few changes to have It read from the view or It would only show the most recent data (< 1 month). We need to take this in consideration when performing this change.

 

EXAMPLE

 

--Initial setup:

--

--1. Create two additional tables to store the older data per month, as for example: Log_1month, Log_2month

--

--2. Grant permissions to the new tables

--

--3. Create a new index for insert_time on the tables.

--CREATE INDEX Log_insert_time ON Log (insert_time);

--CREATE INDEX Log_1month_insert_time ON Log_1month (insert_time);

--CREATE INDEX Log_2month_insert_time ON Log_2month (insert_time);

--

--4. Create a view to select the 3 tables if necessary, as for example:

--CREATE VIEW dbo.Log_select

--AS

--select * from dbo.[Log]

--UNION ALL

--select * from dbo.[Log_1month]

--UNION ALL

--select * from dbo.[Log_2month]

--

--5. Update Statistics 

 

DECLARE @minDate DATE;

DECLARE @limitDate DATE;

 

--We will remove everything that is older than 2 months, so that is going to be the limitDate:

SET @limitDate = cast(DATEADD(month,-2,CURRENT_TIMESTAMP) AS DATE);

 

--Checking what is the newest data in the table log_2month:

SELECT @minDate = max(insert_time) from dbo.log_2month;

 

print @minDate

print @limitDate

 

--If the table log_2month has data newer than 2 months, which is the retention period, the process is not executed

IF (@minDate <= @limitDate or @minDate is NULL)

BEGIN

       truncate table dbo.[Log_2month]

 

       BEGIN TRANSACTION

              EXEC sp_rename 'dbo.Log_2month','Log_new'; 

              EXEC sp_rename 'dbo.Log_1month','Log_2month';

              EXEC sp_rename 'dbo.Log','Log_1month';

              EXEC sp_rename 'dbo.Log_new','Log';

 

              --Change the identity of the table to continue from the ID the other one was

              declare @currentSeedValue int;

              declare @changeIdent_sql nvarchar(max);

 

              set @currentSeedValue = IDENT_CURRENT( 'dbo.Log_1month' );

              set @changeIdent_sql = 'DBCC CHECKIDENT (''dbo.Log'', RESEED, ' + cast(@currentSeedValue as varchar) + ')';

 

              exec sp_executesql @changeIdent_sql;

       COMMIT;

END

ELSE

       print 'Please double check you need to run this process, as It might have been executed recently!'

 

 

  1. Partition the current table or create a new partitioned table.

 

It is important to point out that, since on Azure SQL DB you do not have control on where the data is physically stored, we can only configure the partitions to use the primary filegroup.

 

If we partition by insert_time, keeping the primary key in the ID, we will need to first recreate the primary key as nonclustered. It would also require to, every time we perform the clean-up process, drop the primary key before performing the truncate and recreate It afterwards, because It will not be a partitioned index, so It won’t support this operation. In terms of efficiency, I believe this is not the best option, so I recommend partitioning by ID.

 

If we partition by the ID, it will not be as precise as the insert_time when we need to perform the truncate. We might have to leave a partition with data that has already reached the retention period, because It might also have newer data. The amount of data will depend on the range of each partition, so this can be mitigated by having smaller ranges.

 

We could partition the current table or create a new partitioned table: The creation of a new partitioned table would avoid the maintenance window, but It would require to grant permissions and rename the tables, so the new one can assume the place of the old one and the application start writing to It. We could then keep the old table for the historical data until the retention period is reached.

 

EXAMPLE (Partitioning the existing table)

 

--I create the partitions for every 100.000 records, but you can reduce the range as much as you would like

CREATE PARTITION FUNCTION [PF_PARTITIONBYID](int) AS RANGE RIGHT

FOR VALUES ('100000','200000','300000','400000','500000','600000', '700000','800000','900000','1000000',

            '1100000','1200000','1300000','1400000','1500000', '1600000','1700000','1800000','1900000','2000000');

 

--As It's not possible to manage where the data will be physically stored on Azure SQL DB, you have to set all to primary

CREATE PARTITION SCHEME [PS_PARTITIONBYID]

AS PARTITION [PF_PARTITIONBYID]

ALL TO ([PRIMARY]);

 

--You can recreate the PK as the clustered partitioned index

alter table dbo.[Log] DROP CONSTRAINT [PK_Log]

 

ALTER TABLE dbo.[Log] ADD CONSTRAINT [PK_Log] PRIMARY KEY CLUSTERED  (id)

   WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,

         ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON PS_PARTITIONBYID(id)

 

--You can create a partitioned index on insert_time and id to make the search for the partitions to remove/truncate faster. 

CREATE INDEX Log_insert_time ON Log (insert_time, id);

 

--To check the partitions

SELECT SCHEMA_NAME(o.schema_id) + '.' + OBJECT_NAME(i.object_id) AS [object]

       , p.partition_number

       , i.index_id

       , p.rows

       , CASE boundary_value_on_right

              WHEN 1 THEN 'less than'

              ELSE 'less than or equal to'

          END AS comparison

       , rv.value

FROM sys.partitions p

INNER JOIN sys.indexes i

ON p.object_id = i.object_id

AND p.index_id = i.index_id

INNER JOIN sys.objects o

ON p.object_id = o.object_id

INNER JOIN sys.partition_schemes ps

       ON ps.data_space_id = i.data_space_id

INNER JOIN sys.partition_functions f

       ON f.function_id = ps.function_id

INNER JOIN sys.destination_data_spaces dds

       ON dds.partition_scheme_id = ps.data_space_id AND dds.destination_id = p.partition_number

INNER JOIN sys.filegroups fg

       ON dds.data_space_id = fg.data_space_id

LEFT OUTER JOIN sys.partition_range_values rv

       ON f.function_id = rv.function_id AND p.partition_number = rv.boundary_id

WHERE o.object_id = OBJECT_ID('dbo.Log')

 

 

Note: For a new table, it would be basically the same steps, since we will first need to create a copy, without data, of the old one. The only thing that would change would be the name of the table in which we would be creating the partition and the rename of both tables at the end.

 

EXAMPLE (Truncating the partition)

 

--Get the list of partitions to be removed, based on the 2 months retention period

 

DECLARE _id int;

declare @truncate_sql nvarchar(max)

declare @merge_sql nvarchar(max)

SELECT _id = max(id) from dbo.Log where insert_time < cast(DATEADD(month,-2,CURRENT_TIMESTAMP) As Date);

 

SELECT @truncate_sql = 'TRUNCATE TABLE dbo.[Log] WITH (PARTITIONS(' + CAST(min(partition_number) AS VARCHAR) + ' TO ' + CAST(max(partition_number) AS VARCHAR) + '))'

, @merge_sql = 'ALTER PARTITION FUNCTION [PF_PARTITIONBYID]() MERGE RANGE (' + cast(max(rv.value) as varchar) + ')'

FROM sys.partitions p

INNER JOIN sys.indexes i

ON p.object_id = i.object_id

AND p.index_id = i.index_id

INNER JOIN sys.objects o

ON p.object_id = o.object_id

INNER JOIN sys.partition_schemes ps

       ON ps.data_space_id = i.data_space_id

INNER JOIN sys.partition_functions f

       ON f.function_id = ps.function_id

INNER JOIN sys.destination_data_spaces dds

       ON dds.partition_scheme_id = ps.data_space_id AND dds.destination_id = p.partition_number

INNER JOIN sys.filegroups fg

       ON dds.data_space_id = fg.data_space_id

LEFT OUTER JOIN sys.partition_range_values rv

       ON f.function_id = rv.function_id AND p.partition_number = rv.boundary_id

WHERE i.index_id < 2 AND o.object_id = OBJECT_ID('dbo.Log') AND rv.value <= _id

 

print 'max_id: ' + cast(_id as varchar)

print ''

print 'truncate command: ' + @truncate_sql

print ''

print 'merge command:' + @merge_sql

 

--I will leave the executions commented, so there is no risk of running by mistake

--exec sp_executesql @truncate_sql

--exec sp_executesql @truncate_sql

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.