Lesson Learned #482: Identifying Potential Duplicate Statistics

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

Some time ago, we encountered a support case where a customer experienced significant delays in updating auto-created and user-created statistics. I would like to share the insights gained from this experience, including the underlying causes of the issue and the potential solutions we identified and implemented to address the problem effectively.

The Scenario

Consider the following table:

CREATE TABLE EmployeesStats ( EmployeeID INT PRIMARY KEY IDENTITY(1,1), LastName VARCHAR(255), FirstName VARCHAR(255), Department VARCHAR(255) );

We added rows to this table and rebuilt the clustered index:

-- Insert sample data BEGIN TRANSACTION; DECLARE @i int = 0; WHILE @i < 1000000 BEGIN INSERT INTO EmployeesStats (LastName, FirstName, Department) VALUES (CONCAT('LastName', @i), CONCAT('FirstName', @i), CASE WHEN @i % 5 = 0 THEN 'Engineering' ELSE 'Marketing' END); SET @i = @i + 1; END; COMMIT TRANSACTION; ALTER INDEX [PK__Employee__7AD04FF149345346] ON EmployeesStats rebuild

To obtain the total number of rows filtered by the Engineering department, we executed the following T-SQL:

select Department, COUNT(*) FROM EmployeesStats WHERE Department = 'Engineering' group by Department

With auto-create statistics enabled in our database, SQL Server created an auto-created statistic using the Department column for this table.

Following this, we found that the performance was not optimal, and upon reviewing the execution plan, we discovered that creating an index on the Department column would improve performance by approximately 82%. Consequently, the index was created:

-- Creating a nonclustered index CREATE NONCLUSTERED INDEX [Index_Departament] ON [dbo].[EmployeesStats] ([Department]);

At this point, we observed the following:

Two statistics were present: _WA_Sys_00000004_09C5A64A and Index_Departament.
The execution plan would utilize and update (if needed, with auto-update statistics enabled) the Index_Departament statistics.
The _WA_Sys_00000004_09C5A64A statistic would never be used.
However, when our customer used sp_updatestats or another tool that required updating all auto/user-created statistics, this statistic would be recalculated, which could be time-consuming depending on the number of rows.

To address this, we advised our customer to check for potentially duplicated statistics and to review each one to determine whether it could be deleted. As an example, we provided the following T-SQL to identify all statistics for this table where a column is involved in any statistics, allowing for later review and decision-making on whether to retain or delete them.

It's important to note that manual review is often the best approach to deciding on the maintenance of statistics, as a column may be used in multiple definitions. Additionally, it's crucial to test any changes in a test environment before implementation to avoid potential performance issues:

WITH StatsColumns AS ( SELECT s.object_id, s.stats_id, s.name AS StatisticName, s.auto_created, s.user_created, c.name AS ColumnName, sc.stats_column_id FROM sys.stats s INNER JOIN sys.stats_columns sc ON s.stats_id = sc.stats_id AND s.object_id = sc.object_id INNER JOIN sys.columns c ON sc.column_id = c.column_id AND sc.object_id = c.object_id ), DuplicatedStats AS ( SELECT sc1.object_id, sc1.ColumnName, STRING_AGG(sc1.StatisticName, ', ') WITHIN GROUP (ORDER BY sc1.StatisticName) AS StatisticsNames, COUNT(*) AS NumberOfDuplicatedStats FROM StatsColumns sc1 GROUP BY sc1.object_id, sc1.ColumnName HAVING COUNT(*) > 1 ) SELECT OBJECT_NAME(ds.object_id) AS TableName, ds.ColumnName, ds.StatisticsNames, ds.NumberOfDuplicatedStats FROM DuplicatedStats ds WHERE ds.object_id = OBJECT_ID('EmployeesStats') ORDER BY TableName, ds.ColumnName;

Here's an example of the output

Enjoy!

The Scenario

Leave a Reply Cancel reply