Lesson Learned #165: How to reduce the time spent of downloading a large resultset

This post has been republished via RSS; it originally appeared at: Azure Database Support Blog articles.

Today, I worked on a service request that our customer wants to download a large resultset from Azure SQL Database to OnPremises. The table has several blob columns (XML,Varchar(max),text) and millions of rows. In this situation, I would like to share with you several tests that I did and how to reduce the download time spent.

Initial points

Try to increase the packet size in your connection string to higher values.
Instead to use Proxy connection policy use Redirection connection policy to improve the connection.
About the redirection, remember to use the latest drivers because some old drivers are not able to use redirection.
As this process is a pure data processing, if possible, try to use Premium or Business Critical to reduce the I/O latency.
In OnPremises try to distribute the data and log files in different location to improve the IO.

In Azure SQL Database, I created a table and filling the data:

Basically, I created the following table:

CREATE TABLE [dbo].[Destination]( [ID] [int] IDENTITY(1,1) NOT NULL, [Name1] [varchar](4000) NULL, [Name2] [varchar](4000) NULL, [Name3] [varchar](4000) NULL, [Name4] [varchar](4000) NULL, [Name5] [varchar](4000) NULL, PRIMARY KEY CLUSTERED ( [ID] ASC ))

Running multiple times the following query, I got around 7 millions of rows.

INSERT INTO Destination (Name1,Name2,Name3,Name4,Name5) values(Replicate('X',4000),Replicate('X',4000),Replicate('X',4000),Replicate('X',4000),Replicate('X',4000)) INSERT INTO DESTINATION (Name1,Name2,Name3,Name4,Name5) SELECT Name1,Name2,Name3,Name4,Name5 FROM DESTINATION

In OnPremise:

I developed a small C# aplication that has 3 different process:
- The first process was to read the whole table from Azure SQL Database and using bulkcopy download the data, but the spent time was high. I saw that transfer ratio was about (100-200 mb/s).

private void LoadDataReaderWithoutCompression(C.SqlDataReader newProducts, int lCutOver = 10000, string sDestinationTable = "Destination") { using (C.SqlBulkCopy bulkCopy = new C.SqlBulkCopy(GetConnectionStringTarget(0), C.SqlBulkCopyOptions.KeepIdentity | C.SqlBulkCopyOptions.KeepNulls | C.SqlBulkCopyOptions.TableLock)) { bulkCopy.DestinationTableName = sDestinationTable; try { bulkCopy.BulkCopyTimeout = 6000; bulkCopy.SqlRowsCopied += new C.SqlRowsCopiedEventHandler(OnSqlRowsCopied); bulkCopy.NotifyAfter = 2000; bulkCopy.EnableStreaming = false; bulkCopy.BatchSize = lCutOver; bulkCopy.WriteToServer(newProducts); } catch (Exception ex) { Console.WriteLine(ex.Message); } } }

The second process was to compress the data using COMPRESS function before downloading the data. Basically, the idea was:
- Create a table with the following structure.
- Execute the query INSERT INTO [_M$_Destination_X] (ID,NAME1,NAME2,NAME3,NAME4,NAME5) SELECT ID,COMPRESS(NAME1) AS NAME1,COMPRESS(NAME2) AS NAME2,COMPRESS(NAME3) AS NAME3, COMPRESS(NAME4) AS NAME4, COMPRESS(NAME5) AS NAME5 FROM Destination
- Download using bulkcopy the compressed data
- Uncompress the data in the destination, running the following TSQ:, INSERT INTO [Destination] (ID,NAME1,NAME2,NAME3,NAME4,NAME5) SELECT ID,DECOMPRESS(NAME1) AS NAME1,DECOMPRESS(NAME2) AS NAME2,DECOMPRESS(NAME3) AS NAME3, DECOMPRESS(NAME4) AS NAME4, DECOMPRESS(NAME5) AS NAME5 FROM [_M$_Destination_X]

CREATE TABLE [dbo].[_M$_Destination_X]( [ID] [int] NOT NULL, [Name1] [varbinary](max) NULL, [Name2] [varbinary](max) NULL, [Name3] [varbinary](max) NULL, [Name4] [varbinary](max) NULL, [Name5] [varbinary](max) NULL, PRIMARY KEY CLUSTERED ( [ID] ASC ))

The second execution process was very good because I was able to skip the networking issue compressing and de-compressing. But, was only a thread running, what happening if I have millions and millions of rows, well, in this situation, I modified the source using a configurable number of threads (for example, running in parallel 10 threads reading 150000 rows each one).
- Every process read 150000 rows, using the following TSQL: INSERT INTO [_Tmp100] (ID,NAME1,NAME2,NAME3,NAME4,NAME5) SELECT ID,COMPRESS(NAME1) AS NAME1,COMPRESS(NAME2) AS NAME2,COMPRESS(NAME3) AS NAME3, COMPRESS(NAME4) AS NAME4, COMPRESS(NAME5) AS NAME5 FROM Destination ORDER BY ID OFFSET 0 ROWS FETCH NEXT 150000 ROWS ONLY
- Using bulkcopy I transferred the data to the OnPremise service.
- Finally running the query I was able to uncompress the data and save in the destination table, INSERT INTO [Destination] (ID,NAME1,NAME2,NAME3,NAME4,NAME5) SELECT ID,DECOMPRESS(NAME1) AS NAME1,DECOMPRESS(NAME2) AS NAME2,DECOMPRESS(NAME3) AS NAME3, DECOMPRESS(NAME4) AS NAME4, DECOMPRESS(NAME5) AS NAME5 FROM [_Tmp100]

At the end, I was able to reduce the time spent in hours for this process. I got other lessons learned for OnPremises to speed up the process and reduce the PageIOLatch contention, but, this will be for another post.

Enjoy!

Leave a Reply Cancel reply