Maximizing Throughput with TVP

Posted by

This post has been republished via RSS; it originally appeared at: DataCAT articles.

First published on MSDN on Sep 23, 2013

Authored by AzureCAT team

As of September 1, 2013 we decided to remove SQLCAT.COM site and use MSDN as the primary vehicle to post new SQL Server content. This was done to minimize reader confusion and to streamline content publication. MSDN SQLCAT blogs already includes most SQLCAT.COM Content and will continue to be updated with more SQLCAT learnings.  You can also find a collection of our work in SQLCAT Guidance eBooks.

To make the transition to MSDN smoother, we are in the process of reposting a few of SQLCAT.COM’s blogs that are still being very actively viewed.  Following is reposting of one of the SQLCAT.Com blogs that still draws huge interest.  You can find this, and other guidance in the SQLCAT's Guide to Relational Engine eBook.

Also, Follow us on Twitter as we normally use our Twitter handles @SQLCAT and @WinAzureCAT to announce news and new content.

Author: Steve Howard

Technical Reviewers: Silvano Coriani, Sanjay Mishra, James Podgorski, Mark Simms


This technical note looks at considerations of whether to use the SqlBulkCopy, or Table Valued Parameters (TVPs) in a customer scenario encountered as part of a CAT engagement. The decision of which is better depends on several considerations which will be discussed. TVPs offer several performance optimization possibilities that other bulk operations do not allow, and these operations may allow for TVP performance to exceed other bulk operations by an order of magnitude, especially for a pattern where subsets of the data are frequently updated.

Executive Summary

TVPs and MERGE operations make a powerful combination to minimize round trips and batch insert and update data with high throughput. Parallel operation on naturally defined independent sets of data can be performed efficiently like this. The TVP makes optimizations possible that are not possible with bulk insert or other operations types. To get the most out of the operation, you must optimize your underlying table as well as your method for inserting and updating the data. The principles followed in this case emphasize these points:

·       Do not create artificial keys with IDENTITY when it is not necessary. This creates a point of contention on heavy, parallel insert operations.

·       If old data key values will not expire, use a MERGE operation instead of DELETE and INSERT. This minimizes data operations; rebalancing and page splits, and the amount of data that must be replicated. If old data key values will expire, then test two operations of MERGE followed by a deletion of only the expired keys rather than a DELETE and INSERT of the full data set.

·       If not all the data will be changed, modify the “WHEN MATCHED” portion of the MERGE statement to also check that the data that may change has changed, and only update the data that is actually changed. This minimizes the number of rows of data that are actually modified, and thus minimizes the amount of data that must be replicated to secondaries in Windows Azure SQL Database environments.

Although these are best practices in any environment they become increasingly important in a shared environment such as Windows Azure SQL Database.


In a recent engagement, a problem was encountered in performance of inserting and updating data in Windows Azure SQL Database. The scenario was:

·       Data from hundreds of thousands of devices needs to be stored in Windows Azure SQL Database

·       Each device stores approximately 8000 rows of configuration data across three tables in the database

·       Data for each device is updated approximately once per day

·       Only the most current data is stored in Windows Azure SQL Database

·       The data must be processed at a sustained rate of six devices per second (Approximately 48,000 rows per second)

The first concept tried was to delete the data for each device first, then use the BulkCopy API to insert the new rows. Several worker role instances were used to scale out the processing of the data into the database. However; when running this against an Azure SQL Database, this did not give the performance the scenario demanded.

The second approach was to use Table Valued Parameters (TVPs) with stored procedures to do the processing. In the stored procedures, the data was validated first. Next, all existing records were deleted for the device being processed, and then the new data was inserted. This did not perform better than the previous bulk insert option.

We were able to improve the process to meet the performance demands by making optimizations to the tables themselves, and to the stored procedures in order to minimize the lock, latch, and Windows Azure SQL Database specific contention the process initially encountered.

Optimizing the Process

Several optimizations were made to this process.

First, the underlying tables contained identity columns, and data needed to be inserted in sets from several different processes. This created both latch, and lock contention. Latch contention was created because each insert is performed only on the last page of the index, and several processes were trying to insert to the last page simultaneously. Lock contention is created because the identity column was the primary key and clustered index key, so all processes had to go through the process of having the identity value created, then only one at a time could insert. To remedy this type of contention, other values within the data were used as a primary key. In our example, we found composite keys of DeviceID and SubCondition in one table, and a combination of three columns in the second that third tables that could be used to maintain entity integrity. Since the IDENTITY column was not really necessary, it was dropped.

An example of the optimization of the table is

Original Table Definition:



RecordID     BIGINT         NOT NULL     IDENTITY(1, 1),

DeviceID     BIGINT         NOT NULL,

SubCondition NVARCHAR(4000) NOT NULL,


SubValue     TINYINT        NOT NULL,



RecordID ASC



Optimized Table Definition:



DeviceID     BIGINT         NOT NULL,

SubCondition NVARCHAR(4000) NOT NULL,


SubValue     TINYINT        NOT NULL,



DeviceID ASC,

SubCondition ASC



The second suboptimal part of the described process is that the stored procedure deleted the old data for a device first, then re-inserted the new data for the device. This is additional maintenance of a data structure as deletes can trigger re-balancing operations on an index, and inserts can result in page splits as pages are filled. Making two data modifications, each with implicit maintenance work that must be done, should be avoided when the data operation can be done with one operation. A “MERGE” operation can be used in place of a DELETE then INSERT provided the updated set of data will not omit previous rows of data. In other words, this works if no SubConditions for any DeviceID in the example table will expire.

Any time data is modified in Windows Azure SQL Database, it must be replicated to two replicas. The DELETE then INSERT method was inefficient in this as well since both the delete and the insert operation must be replicated to the Azure SQL Database replicas. Using a MERGE with only the WHEN MATCHED and WHEN NOT MATCHED conditions will eliminate this double operation, and thus eliminates half of the data replication, but it still modifies every row of data. In the case of this scenario, at most, 10% of the incoming data would actually be different from existing data. By adding an additional condition to the MATCHED condition so that it reads “WHEN MATCHED AND ( <>” only the rows that contained actual data differences were modified, which means that only the data that was actually changed in the incoming data needed to be replicated to secondaries. Making this modification minimized SE_REPL_SLOW_SECONDARY_THROTTLE, SE_REPL_ACK, and other SE_REPL_* wait types.

The last area of optimization we took was to ensure efficient joining with minimal chance for contention among processes. This action was taken because the optimizer tended to want to scan both source and target tables to perform a merge join when processing the MERGE operation. This was not only inefficient, but caused significant lock contention. To eliminate this contention, the query was hinted with “OPTION (LOOP JOIN)”.

An example of the MERGE written to minimize the amount of data that must be processed into the tables is:

CREATE PROCEDURE [dbo].[TVPInsert_test]

@TableParam  TVPInsertType_test READONLY



MERGE dbo.Table1 AS target

USING @TableParam AS source

ON target.DeviceID = source.DeviceID

and target.SubCondition = source.SubCondition


(Source.Value != target.Value

OR Source.SubValue != Target.SubValue)


UPDATE SET Value = Source.Value, SubValue = Source.SubValue


INSERT (DeviceID, SubCondition, Value, SubValue)

VALUES (Source.DeviceID, Source.SubCondition

, Source.Value, Source.SubValue)



The Table Value Type definition created for use with this stored procedure:



DeviceID     BIGINT             NOT NULL,

SubCondition NVARCHAR(4000)      NOT NULL,

Value        NVARCHAR(MAX)       NOT NULL,

SubValue     TINYINT             NOT NULL,



DeviceID ASC,




NOTE: Check the properties of joins before using join hints. Loop joins can explode in cost when scans, including range scans, are performed on the inner table (the table accessed second). However; the join in the MERGE is primary key to primary key. In this case, there is no chance for range scans on the inner table, and therefore, the risk of cost explosion on the loop join is eliminated.

Testing the Optimization

Performance was tested by running multiple concurrent processes to process data. Elapsed time was measured as only the time it took to execute the three stored procedures for the in-processing of the new data. In different tests, the amount of data changed in each incoming data set varied so that measurements could be taken with 10%, 18%, 25%, or 100% modified data. Since 10% was determined to be the most that would ever be seen on any particular processing day, the changed percentage of 10% was used as the main indicator of the amount of improvement the optimizations would yield, and other percentages were used to give an indication of what might happen should an exceptional day produce much different data.

To test headroom, the tests were run with 4, 8, 12, and 16 concurrent processes. 8 was considered to be the number of worker roles that would normally be processing data, so this was the main test of record. Testing with 12 and 16 concurrent processes allowed us to determine if it was likely that adding worker roles improved or hurt throughput, and thus evaluate whether bursts above the normal level of processing could be handled by scaling out the worker role tier.

In the tests, data was processed with no delay between data sets, and the elapsed time to process the data into the database was recorded. Initially, 8000 tests were run and statistics taken on it to give the indication. However; the number of tests was reduced to 1000 when comparing just the original stored procedures with the optimized stored procedures because the original method produced so much contention that it became obvious with the lower number of tests that the optimizations were worthwhile.

The comparison between stored procedures with 8 concurrent processes was:


Original Stored Procedures and Tables

Optimized Stored Procedures and Tables







Standard Deviation












* MS/Device was calculated as the average time/number of concurrent processes. It can be read as “On average, one device was processed every ____ milliseconds.”

This articles are republished, there may be more discussion at the original link. But if you found this helpful, you're more than welcome to let us know!

This site uses Akismet to reduce spam. Learn how your comment data is processed.