First published on MSDN on Sep 05, 2017
Authored by Murshed Zaman
Reviewed by: John Hoang, Dimitri Furman, Mike Weiner, Sanjay Mishra
We often get questions from customers related to Azure SQL DW being a good fit for moving a particular database to the cloud. Some of the typical questions are:
I have a database on-premises. I would like to move it into the cloud because that seems to be the future direction. Can I use Azure SQL DW for it?
I am running out of storage space on my on-premises system. I would like to move my database to the cloud. Is Azure SQL DW an option?
My database size is bigger than what Azure SQL DB supports today, scaling out seems like a good idea. Can I use Azure SQL DW?
In this blog, we would like to clarify some of the concepts around RDBMS usage related to OLTP and OLAP workload, Symmetric Multiprocessing (SMP) and Massively Parallel Processing (MPP), workload patterns and anti-patterns, focusing on Azure SQL DW.
What a Data Warehouse Is Not
Data Warehouse (or OLAP) workload is very different than
online transaction processing
(OLTP) workload with very different indexing strategy and data access pattern.
Transaction processing systems, aka operational systems, or OLTP systems, can be categorized by real time transactions that perform atomic writes and reads . Think about an order processing systems where the backend is a database for a web ordering tool. A customer logs in, buys one or many items and checks out – the interaction between the application and the database during a user session is a series of transactions. Another example of transaction processing is the workload generated by the ATM machines of a bank. Airline ticketing system is another example of an OLTP system, or order processing system.
Usually used by hundreds if not thousands of users concurrently, OLTP systems are created for high volume inserts/updates/deletes and point lookup reads. An outage in your ordering system may be the cause of losing money, thus high availability of your application and backend database is a must.
What a Data Warehouse Is
On the other hand, traditionally data warehouse workloads are write once and read many times. The writes usually happen in a batch fashion, instead of many small inserts (like in OLTP systems). And the reads usually produce some sort of aggregated result, instead of producing individual records.
Called by many acronyms data warehouse (DW or DWH), or enterprise data warehouse (EDW), or Online Analytical Processing (OLAP) systems, such systems usually integrate data from many transactional systems for reporting, analysis, and decision support purposes. Data is extracted, transformed and loaded (this is known as an ETL process) from many disparate systems to create a data warehouse.
Data warehouses sometimes may have many stages, before data can be easily analyzed. Three commonly used stages that I have observed at customer site are described below:
Staging database store – Here data is copied from transactional systems, usually for the last day/hour. This is a temporary store for transactional data.
Operation data store (ODS) – transactional data from the staging store is validated and added to historical data unchanged or slightly modified from transactional systems.
Star Schema conformed databases – ODS data is transformed into “facts” and “dimensions”. On an entity-relationship diagram (ERD), a fact table appear in the middle, and the dimension tables are on the periphery, making it look like a star.
Other variations of this pattern of stages also exists, for example, ODS only or star schema only.
Workload patterns on a data warehouse can be characterized as:
Batch loading of data
Transformed into fact and dimension tables
Complex queries involving multiple table joins
Aggregates over a certain dimension key (example: date or customer)
Following are some examples of complex questions a data warehouse could answer:
How many customers ordered a certain item within a month to see if inventory levels are sufficient?
What day of week people withdraw the most amount of money from an ATM?
What was the cost of a promotional pricing (marketing) vs. how many tickets were sold to a certain destination (how much money was made) for the airline?
Azure SQL DW, MPP and SMP
Azure SQL DW is a Massively Parallel Processing (MPP)
service. It is a service because Microsoft maintains the infrastructure and software patching to make sure it’s always on up to date hardware and software on Azure. The service makes it easy for a customer to start loading their tables on day one and start running queries quickly and allows scaling of compute nodes when needed.
In an MPP database, table data is distributed among many servers (known as compute or slave nodes), and in many MPP systems shared-nothing storage subsystems are attached to those servers. Queries come through a head (or master) node where the location metadata for all the tables/data blocks resides. This head node knows how to deconstruct the query into smaller queries, introduce various data movement operations as needed, and pass smaller queries on to the compute nodes for parallel execution. Data movement is needed to align the data by the join keys from the original query. The topic of
data movement in an MPP system
is a whole another blog topic by itself, that we will tackle in a different blog. Besides Azure SQL DW, some other examples of a MPP data warehouses are Hadoop (Hive and Spark), Teradata, Amazon RedShift, Vertica, etc.
The opposite of MPP is SMP (Symmetric Multiprocessing) which basically means the traditional one server systems. Until the invention of MPP we had SMP systems. In database world the examples are traditional SQL Server, Oracle, MySQL etc. These SMP databases can also be used for both OLTP and OLAP purposes.
One quick way to remember the difference is that you can
an SMP system by adding processors with more CPU cores or faster CPU cores, add more memory, and use a faster I/O subsystem. For an MPP system you can
by adding more compute nodes (which have their own CPU, memory and I/O subsystems). There are physical limitations to scaling up a server at which point scaling out is more desirable depending on the workload.
Azure SQL DW Anti-Patterns
Before we discuss what workload is good for Azure SQL DW, let’s discuss what Azure SQL DW is not good for. Azure SQL DW is not a good match for the following scenarios:
High volume of small reads and writes
Frequent changing of schema
Row by row processing
JSON, XML data and Spatial, Struct, Array and Map data types
Power BI direct query requiring dashboard performance
High concurrency of queries (eg. hundreds of thousands of concurrent queries)
Small datasets (less than 250GB)
Disaster recovery with stringent RPO and RTO
We have already covered the first three points at the beginning of our blog. Let’s cover some more not so obvious points listed above.
Multi-Tenant Data Warehouse
Azure SQL DW is not a good fit for solutions that share the same data warehouse for multiple customers. Though Azure SQL DW allows separate schemas, development cost and complexity for a single database with separate schemas for multiple customers is quite high, can be a security nightmare if compromised. It is impossible to restore a single schema if something goes wrong for a particular customer. Also, SQL Server security features such as Always Encrypted, Row Level Security, Column Level Encryption are not present in Azure SQL DW as of this writing.
As PaaS service, Azure SQL DW makes it easy for customers to create and load their data in the data warehouse. Thus, currently it has a simple implementation of workload management that is not customizable. To be successful, a multi-tenant database would need a customizable workload management per tenant workload.
Frequent changing of schema
Azure SQL DW is a strongly typed RDBMS (Relational Database Management System). Like other traditional databases, it is a schema-on-write system where you create the schema first and then write the data. Reading this data back can be done using common SQL language. On the other extreme, schema-on-read allows loading the data first and then shifts the retrieval of data on to the developer who has to write code to retrieve the data.
Frequent changing of schema falls somewhere in between. There is operational complexity related to implementing frequent schema changes in Azure SQLDW and in other traditional RDBMS (SQL Server, Oracle, etc.). If the nature of your data warehouse is such that your schema changes frequently or your upstream systems are not standardized into a certain schema, you will have to conform to a common schema while doing transformation (ETL).
Row by row processing
Azure SQL DW is not good for row by row processing of data. As an example, if your join criteria in your query is such that you need a scalar UDF to match the column equality (or inequality) you will do row by row processing. In this case, you are not getting the benefit out of your MPP system which is tuned to favor set based operations.
JSON, XML data and Spatial, Struct, Array and Map data types
At the time of the writing of this blog, Azure SQL DW doesn’t support JSON or XML data types or functions. You will have to create a tabular structure (with schema) out of your JSON or XML data before you can load it into Azure SQL DW. Azure SQL DW also currently doesn’t support Spatial, Struct, Array and Map data types. If the prevalence of data for your data warehouse is JSON, XML or if you need the complex data types listed above, you are better off choosing some other database products.
High Concurrency and Dashboard queries
Azure SQL DW is not suitable for queries that come from a BI Dashboards (Power BI or Tableau) reports. This is mainly because dashboard queries require very low response time (often milliseconds to 1 or 2 seconds) and there are many queries that get executed to refresh a single BI dashboard.
Currently Azure SQL DW doesn’t have plan caching mechanism in place. Each query, whether it was run previously or not, has to go through plan search space to find an execution plan, before it starts executing. Coupled with that, if data movement is needed to satisfy a query, query time often diminishes the expectation of a dashboard query behavior. Also, not unlike other MPP systems, SQL DW has a limit to how many queries can run at the same time, which can also contribute to slower query return to dashboard reports.
If you need dashboard query response time, the recommendation is to create an SSAS or Azure AS or Tableau cube from the SQL DW data, to satisfy these dashboard query requirements.
Power BI direct query mode
Using Power BI direct query with Azure SQL DW is not recommended. The reasons are almost the same as in the previous section. Power BI direct query mode will perform poorly with large data size of Azure SQL DW.
The recommendation is again the same as the previous section. If the Power BI dashboard is going against summarized tables (small data) that is properly indexed and distributed, you may be able to run with direct query mode against Azure SQL DW. You may want to create these summarized table so that you do not utilize joins when returning data to Power BI dashboard. Always test with the highest volume of data to make sure if it will work with your reporting need.
Azure SQL DW is an MPP system which does a great job with lots of data and complex queries. You will not get the advantage of an MPP system if the size of your data in your data warehouse is too small (less than 250GB).
For data size of less than 250GB use SQL Server or Azure SQL DB.
Disaster Recovery with stringent RPO or RTO
Azure SQL DW does not have the capabilities to automatically replicate data to another hot/standby Azure SQL DW system for disaster recovery purposes. It however provides an automatic
of the data for every 24 hours. In case of a data center outage, you will be able to restore this backup to a
paired data center
. You may incur some data loss if you have updated your SQL DW tables after the last geo back up was created. Also, since this geo-redundant backup is copied to a Standard storage, depending on how much data you have, restore can take a long time. So if your data warehouse has stringent Recovery Point Objective (RPO) and Recovery Time Objective (RTO), Azure SQL DW may not be able to provide it to you.
Azure SQL DW Patterns
SQL DW is an MPP data warehouse service. It was created for data warehouse workload. In Azure SQL DW, complex jobs (queries or load) are broken down into pieces and executed in parallel. Thus, using Azure SQL DW, your large data loads and complex queries will finish faster. To gain performance from your data warehouse on Azure SQL DW, please follow the guidance around
table design pattern
loading patterns and
Scale out compute
If you have a data warehouse that has reached the limit of your SMP hardware (single server), you may be thinking of moving the warehouse to a more expensive and beefier hardware. Sometimes this hardware can be very costly with a support expiring after a number of years. You also have to plan for future capacity when procuring this hardware. Hardware procurement can take many months.
If that is the case and you want to save yourself time and headache, consider migrating to Azure SQL DW. Azure SQL DW allows independent scaling of compute. Compute nodes can be scaled up or down based on your need. As an example, if you are doing nightly ETL, you may want to consider scaling up your Azure SQL DW to finish ETL faster in a specified window, even if your data volume increases in the future.
If you need more processing power for a couple of days right after month-end, you can scale up your Azure SQL DW higher to finish work faster and then scale it back down for the rest of the month.
Similarly, Azure SQL DW can also be paused when not in use. If this fits your criteria, you will save money by pausing your Azure SQL DW when the users are not using it. To be clear here, the money that you save is for the compute engines which is the biggest cost on your bill, you still pay for the data that is on the storage. If you have Azure SQL DW dev/test systems, they often can be paused/resumed to save cost too.
If you have reached your on-premises or cloud storage capacity for your data warehouse workload, you may want to consider moving to Azure SQL DW. Azure SQL DW today is a petabyte scale data warehouse and with continuous improvement the size limitation will only go higher. Usually the larger the data size, the better experience you will have with Azure SQL DW when comparing to your SMP system.
Often, we have seen customers using their Azure SQL DW for consolidating their data warehouses or data marts in one database. This can simplify a DBAs job. It can also allow querying data that crosses the departmental data mart boundaries, if the need arises. Database consolidation can save you money.
For consolidation, you need to use schemas in a single database, rather than multiple databases, as containers for tables because Azure SQL DW doesn’t support cross database queries. Also, database consolidation brings the concern of query concurrency limit, and potential access control issues. Consider these issues and limits before you go too far down the path of consolidating your data warehouses (or marts) into one database.
ETL vs ELT
Azure SQL DW is fast at loading data when PolyBase is used because it loads data in parallel on all the nodes. It is often advised to load the data as is and then do the transform using set based TSQL language inside the data warehouse, where the data processing takes advantage of the MPP nodes. It often leads to performance gain when compared to loading using a SSIS or other similar ETL tools. This is because ETL solutions using SSIS and tools like it are often single threaded when transforming data. In this case, with PolyBase, extract-load-transform (ELT) is preferred over extract-transform-load (ETL).
If the source of the data is Hadoop or Data Lake which has gained popularity in the recent years, it is often the case that Hive or Spark is used as the transformation engine, loading processed data into SQL DW for fast access. The processing of long running transformation (often referred to as ETL) jobs in Hive or Spark is very useful, as the job processors have built-in resiliency and restartability.