Sampling can produce less accurate statistics if the data is not evenly distributed

This post has been republished via RSS; it originally appeared at: SQL Server Support articles.

First published on MSDN on Jul 09, 2010

Recently I worked with a very knowledgable customer who called in and wanting to know things about statistics.  This is because he noticed that his query would get inaccurate cardinality estimate due to ‘inaccurate histogram.    Specifically, he has questioned why AVG_RANGE_ROWS would be very high when he did 10% sampling.  But it became very low (almost distinct) when he used 100% sampling.

In order to illustrate the issue, let me create a fake table and populate data using this script:

create database dbStats
go
go
alter database dbstats set recovery simple
go
use dbStats
go
create table t(c1 uniqueidentifier)

go
set nocount on
begin tran
declare @i int = 0
declare @id1 uniqueidentifier = newid()
while @i< 8000000
begin

declare @id uniqueidentifier
if @i % 100 = 0
set @id = NEWID()
insert into t values (NEWID())
insert into t values (@id)
if @i < 2000000
insert into t values (@id1)
set @i +=1
if (@i % 100000 = 0)
begin
commit tran

Technorati Tags: Performance


begin tran
end
end
commit tran
go

create index ix on t(c1)
go

If you update the statistics with 10% sampling, you will get histogram 1 (below).  But if you update statistics with 100% sampling, you will get histogram 2.  Note that one major difference is that AVG_RANGE_ROWs are much higher in histogram 1 than in histogram 2.

In fact, as you increase sampling rate from 10% to a larger number, the AVG_RANGE_ROWS  will gradually decrease.  First of all, AVG_RANGE_ROWS basically means for any value within a histogram step, how many duplicates are there.   If you have a value of 2 for AVG_RANGE_ROWS, it means for any given value within the histogram step,  it will have 2 duplicates.   SQL Server optimizer uses this to do cardinality estimate for the values falling within a histogram step.

In order to explain what’s going on, let’s take a look at data first.   The data is constructed in a way that is not evenly distributed.   The column has 8 million  distinct values that only appear once.  There is one value that appears  2 million times.  Then there are 80,000 values that appear 100 times within the table.

So overall, the data is very selective.  Out of 18 million rows, there are more than  distinct 8 million values.   The customer’s argument is that for any given value that falls within a histogram, SQL really should estimate less than 2 rows.  AVG_RANGE_ROWS should be less than 2 rows.

When you do 100% sampling, the AVG_RANGE_ROWS is 1.8 or 2.  So it’s accurate.  But with 10% sampling, most AVG_RANGE_ROWS is 18 (much higher).

The reason is that the data is not evenly distributed.   If the data is truly evenly distributed, 10% sampling and 100% sampling will produce similar results.   But if some values appear way more than other values, sampling produce less accurate results.  This is because the more frequent values will have higher chance of being selected to compute statistics. At final stage, the values are scaled up to produce the ‘inflated’ statistics.   This eventually ends up with a statistics histogram that tells SQL Server that data is less selective than really is.

What’s the solution?

There are a couple of things. If you can afford fullscan (100%) or increasing sampling, do that.   If you can’t, you may have to rely in index hints for some queries.

Histogram 1 (10% sampling)

RANGE_HI_KEY                         RANGE_ROWS    EQ_ROWS       DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
------------------------------------ ------------- ------------- -------------------- --------------
ADA7D301-B912-4927-BC84-0000730876EE 0             1             0                    1
1D4D7A35-9E7B-46E7-BF87-01B486B2166B 113919.4      1005.627      6442                 17.6839
2F22F6DE-A893-4826-A8B8-04E5A31C5819 187018.4      1005.627      11932                15.67406
2602AD30-5365-4DE6-BEFB-07E79CB221F0 193601.4      1005.627      11077                17.47708
ADC703C1-2CD6-4107-9177-09268149C7C6 73038.45      1005.627      4597                 15.88914
968BE242-C378-4FDA-9A6F-0B883FCBAB52 169844.2      1005.627      9031                 18.80737
87B47694-B775-4414-B638-103D45D87C84 303451.8      1005.627      17569                17.27227
7DE3FC0A-2A74-4C90-B0E3-12177F27B7F2 110587.5      1005.627      6711                 16.47944
7B9DB839-7F96-46A6-821D-130BC4345A90 66637.23      1005.627      3668                 18.16596
BFC1684B-EB83-4D20-B0DF-150805AC5A8E 109729.3      1005.627      7341                 14.9474

Histogram 2 (100% sampling)

RANGE_HI_KEY                         RANGE_ROWS    EQ_ROWS       DISTINCT_RANGE_ROWS  AVG_RANGE_ROWS
------------------------------------ ------------- ------------- -------------------- --------------
7972667E-CC57-42F4-A414-00000190A98E 0             1             0                    1
CFF7C764-41AA-4E89-B802-00AA81D1D512 39129         100           21012                1.862222
473C80D5-A9CC-4186-B4B3-093CCCEAD064 538744        100           270949               1.988359
61E3899D-74CD-4090-9E76-0A031A54EFD6 44973         100           24480                1.837132
E857ABD0-DA6D-4367-92C7-1D98186AF634 1224175       100           617701               1.981825
8863C50F-160D-4DD3-A1C4-1E24D9DE79D3 31413         100           17454                1.799759
3AF1D87E-805E-49B6-9870-301658472C85 1109208       100           565698               1.960778
20BEE350-A654-43F3-BB84-30A978964057 33231         100           17985                1.847706
443AC49E-FE4B-4E32-9FF9-47A667FF1E0F 1451375       100           725309               2.001044
5BAFE28B-6AF7-42D8-BB9E-48C3D1FB3F7B 67121         100           35342                1.899185
FE43EE54-4D31-4CCF-B231-4D6E02B3F852 295111        100           147502               2.000726
8EC2F5D1-50BC-4452-8F9A-4DFD046D3E5B 32713         100           17566                1.862291
C636084E-7EC5-48B6-A273-5451C9EE8269 393640        100           199600               1.972144
75DE7AB9-EEE9-4426-A907-54E4F00BA1C1 33538         100           18292                1.833479
5F5A4E68-3573-4EB3-8ABC-5ACA7ABE7C73 361499        100           185873               1.944871
48BC6F49-D712-4B7B-A91D-5BC822179B2F 65751         100           31497                2.087532
B22C40FE-1F99-47F8-872D-ACFB95246E52 5078911       100           2564212              1.980691

Jack Li |  Senior Escalation Engineer | Microsoft SQL Server Support


Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.