This post has been republished via RSS; it originally appeared at: DataCAT articles.
First published on MSDN on Jul 20, 2018
Reviewed by: Kun Cheng, Borko Novakovic, Jovan Popovic, Denzil Ribeiro, Rajesh Setlem, Arvind Shyamsundar, Branislav Uzelac
In this article, we describe database storage architecture on Azure SQL Database Managed Instance (MI), for General Purpose (GP) instances specifically. We also provide a set of best practices to help optimize storage performance.
At the time of this writing (July 2018), Managed Instance, both General Purpose and Business Critical, is in public preview. Some aspects of MI storage architecture will likely change as MI progresses from the current preview to general availability and beyond. This article reflects the current state of the offering.
Database storage architecture on MI GP
MI GP uses Azure Premium Storage to store database files for all databases, except for the tempdb database. From the perspective of the database engine, this storage type is remote, i.e. it is accessed over the network, using Azure network infrastructure. To use Azure Premium Storage, MI GP takes advantage of SQL Server native capability to use database files directly in Azure Blob Storage . This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.
Since Azure Premium Storage is used, its performance characteristics, limits, and scalability goals fully apply to MI GP. The High-performance Premium Storage and managed disks for VMs documentation article includes a section describing Premium Storage disk limits . While the topic is written in the context of VMs and Azure disks, which is the most common usage scenario for Azure Premium Storage, the documented limits are also applicable to blobs. As shown in the limits table in the documentation, the size of the blob determines the maximum IOPS and throughput that can be achieved against the blob. For MI GP, this means that the size of a database file determines the maximum IOPS and throughput that is achievable against the file.
The disk/blob size shown in the limits table is the maximum size for which the corresponding limit applies. For example, a blob that is > 64 GB and <= 128 GB (equivalent to a P10 disk) can achieve up to 500 IOPS and up to 100 MB/second throughput.
The current implementation of MI GP does not use blobs smaller than 128 GB (P10). The system will use 128 GB (P10) blobs even for very small database files, to avoid negative performance impact that would be likely with smaller blob sizes (P4 and P6). Additionally, when allocating blobs in Premium Storage, MI always uses the maximum blob size within a storage performance tier. For example, if database file size is 900 GB, MI GP will use a 1 TB (P30) blob for that file. In other words, blob size is snapped up to the maximum size of each storage performance tier. If the file grows above that limit, the system automatically increases blob size to the maximum of the next performance tier. Conversely, if the file is shrunk below the maximum size of a performance tier, blob size is automatically reduced as well.
For billing purposes, MI uses the configured instance file size limit (8 TB or less), not the larger blob size allocated in Azure Premium Storage.
As mentioned earlier, the tempdb database is not using Azure Premium Storage. It is located on the local SSD storage, which provides very low latency and high IOPS/throughput. This article focuses on databases that use remote storage.
Managed Instance Business Critical (BC) instances do not use remote Azure Premium Storage, but instead use local SSD storage. Storage performance considerations discussed in this article do not apply to BC instances.
Azure Storage throttling
When the IOPS or throughput generated by the database workload exceed the limits of a database file/blob, storage throttling occurs. For MI GP instances, a typical symptom of storage throttling is high IO latency. It is possible to see IO latency spike to hundreds of milliseconds when being throttled, while without throttling, average storage latency at the OS level would be in the 2-10 millisecond range. Another symptom of storage throttling is long PAGEIOLATCH waits (for data file IO) and WRITELOG waits (for the transaction log file IO). Needless to say, when storage throttling is occurring, there is a substantial negative effect on database performance.
If you observe the symptoms of storage throttling when running workloads on MI GP, we recommend increasing database file size to get more IOPS/throughput from Azure Premium Storage. More specific recommendations, including a script to determine if storage IO generated by the database workload approaches Azure Premium Storage IOPS/throughput limits, are provided further in the article.
Storage limits on MI GP
There are two storage-related limits that should be considered when using MI GP.
The first limit is on the total size of all database files on the instance. System databases including tempdb are counted towards this limit, and all data files and transaction log files are considered. Currently, this limit is 8 TB per instance, though a customer can configure their instance to use a lower limit. Note that this limit applies specifically to file size as it appears in the size column in the sys.master_files DMV, not to the space actually used to store data within each file.
The second limit is the Azure Premium Storage limit on the maximum space in a storage account, which is currently limited to 35 TB. Each MI GP instance uses a single Premium Storage account.
The up to 8 TB file size limit implies that if there are many databases on the instance, or if a database has many files, then it may be necessary to reduce the size of individual database files to stay within this limit. In that case, the IOPS/throughput against these files would be reduced as well. The more files an MI instance has, the more pronounced is the impact of instance file size limit on storage performance. In the worst case, the size of individual database files may have to be reduced to the point where they all end up using 128 GB (P10) blobs, even if the workload could take advantage of better performance available with larger files.
Whether this limit is relevant for a particular workload depends on its storage performance needs. Many workloads will perform sufficiently well even with smaller database files. The IOPS/throughput that can be obtained for a particular file size can be determined in advance from the Azure Premium Storage limits table.
In the context of migration to MI, customers can measure actual IOPS/throughput on the source system being migrated, to see if it is necessary to increase database file size on MI GP to provide comparable storage performance. For SQL Server on Windows, use the Disk Transfers/sec and Disk Bytes/sec PerfMon counters, for IOPS and throughput respectively. For SQL Server on Linux, use iostat, looking at the sum of r/s and w/s values for IOPS, and the sum of rKB/s and wKB/s for throughput. Alternatively, the sys.dm_io_virtual_file_stats () DMF can be used as well. The script referenced in the best practices section below provides an example of using this DMF to determine IOPS and throughput per database file.
The fact that MI GP does not use blobs smaller than 128 GB (P10), coupled with the fact that the blob size is snapped up to the maximum size of its performance tier, can make the 35 TB limit relevant for some MI GP deployments. As described in documentation, when there are many database files with sizes smaller than the sizes of the blobs they use, then the 35 TB blob size limit may be reached sooner than the 8 TB file size limit. To see the total file and blob sizes for an instance, customers can use the following query. It returns a single row with two columns, one showing the total file size, where the 8 TB limit is relevant, and the other showing the total blob size, where the 35 TB limit is relevant:
WITH DatabaseFile AS
SELECT database_id AS DatabaseID,
CAST(size * 8. / 1024 / 1024 AS decimal(12,4)) AS FileSizeGB
SELECT SUM(FileSizeGB) AS FileSizeGB,
DatabaseID <> 2,
CASE WHEN FileSizeGB <= 128 THEN 128
WHEN FileSizeGB > 128 AND FileSizeGB <= 256 THEN 256
WHEN FileSizeGB > 256 AND FileSizeGB <= 512 THEN 512
WHEN FileSizeGB > 512 AND FileSizeGB <= 1024 THEN 1024
WHEN FileSizeGB > 1024 AND FileSizeGB <= 2048 THEN 2048
WHEN FileSizeGB > 2048 AND FileSizeGB <= 4096 THEN 4096
For example, this result set shows different file and blob total sizes for a MI GP instance:
Comparison with SQL Server in Azure IaaS VM
In a typical deployment of SQL Server on an Azure IaaS VM (SQL VM), database files are placed on Azure disks attached to the VM. Sometimes, a Storage Spaces pool (on Windows) or an LVM volume (on Linux) is created using multiple disks, to aggregate IOPS/throughput provided by multiple disks, and to increase storage size. There are notable differences between this model, and the model used by MI GP. For SQL VMs, available IOPS/throughput are shared among all database files using a disk (or a storage pool), whereas for MI GP, each file/blob gets its own allocation of IOPS/throughput.
Each model has its pros and cons. When there are multiple databases on the instance, MI GP makes it possible to provide a fixed allocation of IOPS/throughput per database file, avoiding the “noisy neighbor” effect of other databases on the instance. On the other hand, each database on a SQL VM can take advantage of higher aggregated IOPS/throughput provided by multiple disks, assuming that all databases do not create spikes in storage utilization at the same time.
One other important difference between MI GP and SQL VM is that the per-VM IOPS/throughput limits, documented for each VM type as Max uncached disk throughput: IOPS / MBps , do not apply to MI. When we talked about throttling earlier in this article, we referred specifically to Azure Premium Storage throttling, and not VM-level throttling. VM-level throttling can occur on SQL VM when cumulative IO requests against all Azure disks attached to the VM exceed the per-VM limits. However, MI GP does not use Azure disks for database files; therefore, the per-VM disk limits are not applicable, and MI instances do not experience VM-level throttling in the way that SQL VMs do.
Performance impact of database file size on MI GP – an example
To illustrate the performance impact of database file size on MI GP, we ran a series of tests using a generic OLTP workload on an 8 vCore instance with 8 TB storage. This example is not intended as a benchmark , and the numbers shown should not be taken as representative for MI GP in general . Storage performance is highly dependent on workload characteristics, and the results obtained using the same MI instance type and storage layout used in these tests, but a different workload, may be very different. These results are shown only to demonstrate the relative difference in performance when different file size configurations on MI GP are used.
In these tests, we used a database with 3 data files and 1 log file. In the initial test, the size of every file was 32 GB (as noted earlier, the system actually uses 128 GB P10 blobs in this case). In following tests, we first increased the size of the transaction log file, and then gradually increased the size of data files. Finally, we increased the size of all files to over 1 TB (to use 2 TB P40 blobs) to get maximum storage performance possible with this file layout. We observed the changes in workload throughput and behavior using metrics such as batch requests/second (BRPS), CPU utilization percentage, PAGEIOLATCH and WRITELOG wait percentage relative to all other non-ignorable waits, and write latency for data files and log file (measured using the sys.dm_io_virtual_file_stats () DMF).
At the storage level, the workload had a 60/40 read/write ratio with cold cache. The database was small enough to fit in memory, so during the test, that ratio changed to nearly 100% writes as most of the data pages became cached in the buffer pool. For this reason, we show only write latency in test results.
Test results are presented in the following table:
|File layout||Average BRPS||CPU utilization (%)||WRITELOG waits (%)||PAGEIOLATCH waits (%)||Average data write latency (ms/write)||Average log write latency (ms/write)|
|3 x 32 GB data files, 32 GB log file||1457||10||21||70||132||16|
|3 x 32 GB data files, 512 GB log file||1732||12||13||85||477||4.5|
|3 x 256 GB data files, 512 GB log file||2422||12||12||80||155||4.7|
|3 x 512 GB data files, 512 GB log file||2706||12||12||77||165||4.4|
|3 x 1.1 TB data files, 2 TB log file||7022||46||74||4||49||4.5|
We can see that for this workload, incrementally growing data and log files up to 512 GB in the first four tests provided gradual improvements. Using 512 GB files, workload throughput (BRPS) was almost twice as high as with the 32 GB files used initially. But increasing the size of all files from 512 GB to over 1 TB provided a 4.8x increase in workload throughput, compared to the initial test throughput. In this last test, removing the bottleneck in data file IO drastically increased both BRPS and CPU utilization, and shifted the dominant wait type from PAGEIOLATCH to WRITELOG as many more write transactions were processed.
To reiterate, the degree of performance improvement from increasing file size on MI GP is workload-dependent. For some workloads, sufficient performance levels can be achieved without using large files. Furthermore, other factors besides storage performance could affect the overall workload performance. For example, increasing file size on a CPU-constrained MI GP instance would not necessarily provide a significant performance increase.
Storage performance best practices for MI GP
In closing, we present a set of storage performance best practices for MI GP instances. These are not absolutes. We encourage the readers to understand the reasons behind these best practices that were discussed earlier in the article, and consider each recommendation in the context of their specific application/workload.
1. Determine if the storage IO generated by the database workload is exceeding Azure Premium Storage limits. A script that examines IOPS/throughput per database file over a period of time, and compares them against the limits, is available on GitHub. If storage IO is near the limits, allocate more IOPS/throughput for the database as described below.
2. When possible, increase space allocation at the MI instance level. This will allow you to increase the size of individual database files, resulting in higher IOPS/throughput limits per file.
3. Increase the size of individual database files as much as possible to utilize available instance space, even if a large portion of each file would remain empty. But consider Azure Premium Storage limits as well. For example, if the size of a file is 150 GB, then increasing it up to 256 GB would not provide any additional IOPS/throughput. MI already uses a 256 GB (P15) blob in this case. This larger file would take some of the instance storage space that may be better used to increase IOPS/throughput for other database files on the same instance. In this example, to get the advantage of the next storage performance tier for this file, it would be necessary and sufficient to increase file size to 257 GB. MI GP would then snap up the blob size for this file to 512 GB (P20).
4. For many write-intensive workloads, increasing the size of the transaction log file provides higher performance improvement than increasing the size of data files.
5. Make all data files equally sized with the same growth increment. While this recommendation is not specific to MI, it is particularly relevant here. On MI GP, a single small file may negatively impact performance of the entire workload, due to storage throttling becoming more likely for that small file.
6. Avoid using many small files for a database. While the aggregate IOPS/throughput from many small files may be comparable to the IOPS/throughput provided by fewer large files, each individual small file is more likely to encounter storage throttling, which affects the entire workload.
7. Consider not only increasing file size, but also adding data files. For example, consider an MI instance with a single database. If this database has a single data file, it can get at most 7500 IOPS for data IO, by making that file larger than 1 TB in size (a P40 or P50 blob would be used depending on the exact file size). But using N files, each larger than 1 TB, would provide up to 7500*N IOPS for data IO, and 7500 IOPS for log IO. For example, you could use six 1.1 TB data files, and one 1.1 TB log file. The total file size would be 7.7 TB, while the total blob size would be 14 TB, remaining within the 8 TB and 35 TB limits respectively. Keep in mind that simply adding empty files to an existing database would not improve performance for most workloads; it is also necessary to reallocate existing data over all files.
In this article, we show how database file size and layout can have a significant performance impact on MI GP instances. We provide recommendations to help customers achieve optimal storage performance, and to avoid common pitfalls. This article applies to General Purpose MI instances. The complexities and limitations we describe do not apply to Business Critical instances, which are recommended for mission critical workloads with high storage performance requirements. If sufficient storage performance cannot be achieved on a General Purpose instance even after following best practices described above, then switching to a Business Critical instance is recommended.
Update 2018-12-18: As of this writing, both the General Purpose and Business Critical MI service tiers implement log rate governance at the instance level. As documented in the Managed Instance resource limits article, this limits log throughput to 22 MB/s per instance for General Purpose instances, and to 3 MB/s per vCore for Business Critical instances. When the maximum instance log rate is reached, one of the dominant wait types will be INSTANCE_LOG_RATE_GOVERNOR. For write intensive workloads such as bulk data loads, this limits write throughput regardless of the transaction log blob size.
There are two potential workarounds to improve data load performance on MI. One is to use tempdb as a temporary staging area for data loads. Tempdb is not subject to instance log rate governance, and provides minimal logging as well. Another workaround is to use clustered columnstore indexes (CCI), which reduce the volume of transaction log writes due to high data compression.