Power Up with Power BI and Lakehouse in Azure Databricks: part 3 – Tuning Azure Databrick SQL

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

This is the third post in a 3-part blog series on Power BI with Azure Databricks SQL authored by Andrey Mirskiy and Diego Fanesi. Read Part 1: Power Up your BI with Microsoft Power BI and Azure Databricks Lakehouse: part 1 - Essentials and Part 2: Power Up your BI with Microsoft Power BI and Lakehouse in Azure Databricks: part 2 - Tuning Power BI

 

In the previous part of this series, we discussed some of the Power BI optimization techniques to achieve better performance of your reports and dashboards. In this part we will be focusing on tuning your Delta lake and Azure Databricks SQL Warehouses to achieve greater performance.

 

We previously discussed how to use Power BI on top of Databricks Lakehouse efficiently. However, the well-designed and efficient Lakehouse itself is the basement for overall performance and good user experience.  We will discuss recommendations for physical layout of Delta tables, data modeling, as well as recommendations for Databricks SQL Warehouses.

 

These tips and techniques proved to be efficient based on our field experience. We hope you will find them relevant for your Lakehouse implementations too.

 

Delta tables

While your data model (tables and columns) is driven by business requirements and data modeling approach (e.g. dimensional model, Data Vault), you do not have many opportunities to tune the model itself. However, you can achieve greater performance by optimizing the physical layout of your tables.

 

Use Delta

First of all, we recommend using Delta format for your tables in the Lakehouse. Delta provides performance optimizations such as data skipping, dynamic file pruning, and many others. 

With our highly performant Photon engine you can achieve much better performance compared to Parquet-tables.

 

Partitioning

There is a common misconception that table partitioning helps with the performance. While this has been true for years or even decades in legacy on-premises data warehouses world and even cloud data lakes based on Parquet-files, this is not always the case with Delta-tables. The Delta Lake maintains table metadata which enables fast query performance even without partitioning in most cases. Therefore, we do not recommend partitioning for tables smaller than 1TB.

 

When you are applying partitioning to larger tables, you should keep the single partition size to at least 10 GB or more. Partitions of 1TB size are still acceptable. This is to make sure the Delta Optimize and Z-Ordering are still able to optimize your data ingestion: every partition should contain at least 10 active parquet files. 

 

Auto Generated Columns

When picking the right partition column, you may need to generate one with a derived value from an existing column. An example could be an events transactions table that has a timestamp column, where you might want to partition the table only at the year, or year and month level. In this case you would create a new computed column with the year and month and partition by it. 

 

The issue you will have in this case is if users only filter on timestamp and do not include an explicit filter on the partition column, ending up with a table scan query. You can mitigate this by using auto generated columns for Delta. Here is the documentation. 

CREATE TABLE events(

   eventId BIGINT,

   data STRING,

   eventType STRING,

   eventTime TIMESTAMP,

   year INT GENERATED ALWAYS AS (YEAR(eventTime)),

   month INT GENERATED ALWAYS AS (MONTH(eventTime)),

   day INT GENERATED ALWAYS AS (DAY(eventTime))

)

PARTITIONED BY (eventType, year, month, day)

 

In the example above please note that the recommendation on the minimum table size for partitioning still applies. We recommend estimating your partition sizes when deciding on partitioning strategy.

 

By doing this, you will not need to add the processing logic for the additional column in your code and Spark will be able to deduce the derived value for the partition column when only the timestamp column is used for filters. 

 

File Size

By default, Delta engine automatically tunes file size based on table size. Delta will use the query history and the table size to understand the best file size for your use case. However, if you just created new tables for a proof-of-concept, Delta will not have enough data to optimize the file size right away. In that case you can consider tuning the file size manually. 

 

In that case, there are a few considerations to make. The first one is that, while Delta will not have any query history to consider, it will still be able to see the table size. So in that case, you can just use a simple property to suggest to Delta what type of workload you are running on the table. This is the case of delta.tuneFileSizesForRewrites. When set to true, it will tell Delta to optimize for frequent updates and deletes, making it selecting the smaller file sizes. 

 

ALTER TABLE mytable SET TBLPROPERTIES (delta.tuneFileSizesForRewrites = True);

 

Additionally, you can set it manually to a specific size. Below is an example of setting a target file size manually to 32MB.

 

ALTER TABLE mytable SET TBLPROPERTIES (delta.targetFileSize = 33554432);

 

Z-Order

Z-Order is another optimization technique which is similar to database indexes, but without creating additional files or data structures to parse and process. Instead, the data in files is organized to colocate similar data, boosting the data skipping algorithm for faster query performance at runtime. Below is an example of how to apply Z-Ordering to a table.

 

OPTIMIZE mytable ZORDER BY joinkey1, predicate2;

 

Tables should be Z-Ordered using the columns which are most frequently used as WHERE or JOIN predicates. However, It is not recommended to use more than 5 columns for Z-Ordering. Also data types can have an impact on the joins performance: joining on string keys is definitely less performant than joining on integers, even when Z-Ordering is applied.

 

Table Statistics

Adaptive Query Execution (AQE) uses table statistics to choose proper join type and other query optimizations. Therefore, it is important to have up-to-date table statistics. This can be achieved by running ANALYZE TABLE.

 

ANALYZE TABLE mytable COMPUTE STATISTICS FOR ALL COLUMNS;

 

However, Delta will only compute statistics for the first 32 columns in a table. This means column ordering in a table can have some importance. Make sure all columns used on where clauses or joins are at least in the first 32 columns. Also remember that Z-Ordering should be applied to columns among the first 32 columns of the table.

 

You can potentially use a configuration key to expand the statistics calculation beyond 32 columns. However, this configuration property should never be set to hundreds of columns as this would make the Delta metadata significantly bigger and time consuming to be processed, affecting all queries on the table. Below is an example on how you can set it to 40 columns.

 

ALTER TABLE mytable SET TBLPROPERTIES (delta.dataSkippingNumIndexedCols = 40);

 

Data model

 

Date/Time dimension

Power BI requires a date dimension table for date/time intelligence functions. Though Power BI offers multiple options for generating date tables, we recommend creating a persisted date table in Delta lake. This approach enables Power BI to generate concise queries which are more efficient to execute by AzureDatabricks SQL.

 

create table default.date_dim as

with date_range as (select

 explode(sequence(to_date('2010-01-01'), current_date(), interval 1 day)) as date

)

select

 date,

 cast(date_format(date,"yyyyMMdd") as int) as date_key,

 cast(date_format(date,"yyyy") as int) as year,

 cast(date_format(date,"yyyyMM") as int) as year_month

from date_range

 

For time dimension we recommend using a separate Delta-table with the required time granularity, e.g. hours only. Having separate date and time dimensions provides better data compression, query performance, and more flexibility to end users.

 

Aggregate tables

Common performance optimization technique is to pre-aggregate data and persist the results for faster query performance in BI tools such as Power BI. In Azure Databricks there are several options which you can use to create aggregate tables.

First, you can use the well-known CREATE TABLE AS SELECT statement in the data preparation pipelines. Therefore, such tables will naturally belong to the gold layer in your Lakehouse.

 

CREATE OR REPLACE TABLE q13_aggregated AS

SELECT Avg(ss_quantity) as avg_ss_quantity,

      Avg(ss_ext_sales_price) as avg_ss_ext_sales_price,

      Avg(ss_ext_wholesale_cost) as avg_ss_ext_wholesale_cost,

      Sum(ss_ext_wholesale_cost) as sum_ss_ext_wholesale_cost

FROM   tpcds_sf1_delta.store_sales,

      tpcds_sf1_delta.store,

      tpcds_sf1_delta.customer_demographics,

      tpcds_sf1_delta.household_demographics,

      tpcds_sf1_delta.customer_address,

      Tpcds_sf1_delta.date_dim

WHERE  s_store_sk = ss_store_sk

 

You can also leverage DLT - Delta Live Tables - to create and maintain aggregate tables. DLT provides a declarative framework for building reliable, maintainable, and testable data processing pipelines. With DLT your materialized aggregate tables can be maintained automatically.

CREATE LIVE TABLE q13 AS

SELECT Avg(ss_quantity) as avg_ss_quantity,

      Avg(ss_ext_sales_price) as avg_ss_ext_sales_price,

      Avg(ss_ext_wholesale_cost) as avg_ss_ext_wholesale_cost,

      Sum(ss_ext_wholesale_cost) as sum_ss_ext_wholesale_cost

FROM   tpcds_sf1_delta.store_sales,

      tpcds_sf1_delta.store,

      tpcds_sf1_delta.customer_demographics,

      tpcds_sf1_delta.household_demographics,

      tpcds_sf1_delta.customer_address,

      tpcds_sf1_delta.date_dim

WHERE  s_store_sk = ss_store_sk

 

Aggregate tables are especially beneficial for large datasets and complex calculations. Organizations can optimize query execution and reduce processing times, resulting in faster data retrieval, more efficient reporting, hence better end user experience. 

 

Primary and Foreign Keys

To improve Power BI models quality and developers productivity we recommend defining Primary and Foreign Keys on your tables in the Lakehouse. Though Primary and Foreign Keys are informational only (not enforced) in Azure Databricks SQL, Power BI can leverage this information to automatically create table relationships in models. 

This capability is expected to become available for customers in the May 2023 Power BI update. Creating Primary and Foreign Keys can be performed by applying constraints to Delta-tables.

Please note that with the Assume Referential Integrity option in table relationships Power BI uses INNER JOINs in SQL-queries which can lead to better query performance in Azure Databricks SQL. Therefore, proper configuration of table relationships in Power BI can improve report performance.

 

Pushdown calculations to Databricks SQL

As discussed in the previous part of this blog series, pushing down calculations to Azure Databricks SQL can sometimes improve overall performance by minimizing the number of SQL-queries and simplifying calculations on the BI tool side. This is especially true in the case of Power BI.

In simple cases calculations can be wrapped in a view. In more complex cases we recommend persisting data in tables in the Gold layer or leveraging Materialized Views.

 

Azure Databricks SQL Warehouse

To render reports Power BI generates SQL-queries and via ODBC-connection sends queries to Azure Databricks SQL Warehouse. Therefore, it is important to know the available options to achieve great performance, hence user experience. 

 

Disk Cache

Azure Databricks SQL uses Disk Cache to accelerate data reads by copying data files to nodes’ local storage. This happens automatically when executing SQL-queries. However, to improve future query performance this can be also done forcefully by running CACHE SELECT statements. It is worth mentioning that Azure Databricks automatically detects changes in base data, therefore no need to refresh the cache after data loads. 

 

Serverless Query Result Cache

Apart from Disk Cache, Azure Databricks SQL has Query Result Cache which stores the results of SELECT-queries and enables faster results retrieval for further executions. In Databricks SQL Serverless SKU this feature provides even better capabilities. Query Result Cache is available across all Azure Databricks SQL Warehouses and clusters within those Warehouses. This means that a result cached on one cluster is available on all clusters and even other SQL Warehouses. This dramatically improves performance and user experience for high-concurrent BI reports.

 

Intelligent Workload Management

Power BI may generate multiple SQL-queries per report, at least 1 SQL-query per visual. While some queries are quite complex processing data from large fact tables, the other queries can be trivial selecting data from smaller fact or dimension tables.

To accommodate such a mix of queries Azure Databricks SQL uses a dual queuing system that prioritizes small queries over large. In other words, small queries are not blocked by large ones. Therefore, overall query throughput and BI reports performance is better.

KatieCummiskeyDatabricks_1-1684447696525.png

 

KatieCummiskeyDatabricks_2-1684447696527.png

 

Concurrency and Sizing

As previously mentioned, Power BI may generate multiple SQL-queries per single report. Hence, even for a small number of reports you may observe 10s or even 100s concurrent queries hitting your Azure Databricks SQL Warehouse. To achieve good performance for all users your SQL Warehouse should be configured for proper size and scaling.

General guidance is:

  • Use more clusters to handle more concurrent users / queries.
  • Use higher cluster size for larger datasets.

KatieCummiskeyDatabricks_3-1684447696529.png

 

Choose the right Azure Databricks SQL SKU

Last but not least, Azure Databricks SQL is available in 3 SKUs - Classic, Pro, and Serverless. Choosing the right SKU is important when planning your solution for future workloads.

  • Serverless, generally available since May 3rd 2023, includes all additional features of Pro SKU is generally the most performant for BI use cases, regardless of the tool used to query the Lakehouse. With compute capacity fully managed by Azure Databricks, customers get their SQL Warehouse clusters instantly without need to wait for cluster provisioning. Features like Intelligent Workload Management and Serverless Query Result Caching enable great performance in high-concurrent BI workloads for 100s or even 1000s of users.
  • Pro provides multiple additional features on top of Classic SKU which directly impact performance. With Predictive I/O, Photon I/O, Materialized Views, Python UDFs, you can achieve better report performance when querying data directly from the Lakehouse without caching in-memory in BI tool. Therefore, faster and better business decisions.
  • Classic provides all standard SQL capabilities including full ANSI SQL compliance, Photon engine, and Unity Catalog integration. It can be a good choice for simple scenarios such as scheduled Power BI dataset refresh where you do not need top performance and cluster startup time is not an issue.

 

Conclusion

In conclusion, optimizing the performance of your Power BI dashboards on top of your Databricks Lakehouse requires a combination of tuning techniques that are specific to Power BI and techniques that are specific to Databricks. We have discussed a range of optimization techniques that can help you improve the performance of your dashboards, including logical table partitioning, Cloud Fetch, Azure Databricks SQL Native Query support, and pushing complex formulas to Azure Databricks SQL. It is important to note that while these techniques can be effective, basic performance improvement best practices such as data filtering, aggregation, and reducing the number of visuals on a page also apply.

 

To get the best performance from your Lakehouse, finding the right balance between the complexity of your queries, the size of your data, and the complexity of your dashboard is crucial. By implementing a combination of techniques and following best practices, you can ensure that your Power BI dashboards on top of your Databricks Lakehouse deliver insights and value quickly and efficiently. 

 

Finally, Azure Databricks SQL Pro and Serverless SKUs are currently undergoing an extended promotional offer, resulting in potential cost savings of up to 30% based on the specific Azure region. Therefore, it is a great chance to try the latest and greatest features of Databricks SQL at a discounted price and discover the full potential of your data.

 

And in case you missed it, read Part 1: Power Up your BI with Microsoft Power BI and Azure Databricks Lakehouse: part 1 - Essentials and Part 2: Power Up your BI with Microsoft Power BI and Lakehouse in Azure Databricks: part 2 - Tuning Power BI

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.