Power Up your BI with Microsoft Power BI and Azure Databricks Lakehouse: part 1 – Essentials

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

This post was authored by Andrey Mirskiy and Diego Fanesi.

 

Welcome to the first post of our series on how to use Azure Databricks SQL with Power BI. In this article, we will focus on how to connect to a Databricks SQL Warehouse from Power BI and query your Delta Lake to populate your dashboards. This scalable approach allows you to efficiently run queries on large data volumes, generating more business value from your datasets. 

 

Many customers have adopted this architecture for their Business Intelligence workloads. Here is a short list of companies that are using Azure Databricks with Power BI today:

In the next sections, we will cover possible architectures you can implement with Power BI and provide you with some criteria to choose between them. We will then cover each and every one separately, giving you a way to quickly set up your connection between the two solutions.

 

Architectures

In a Lakehouse architecture, you want your data lake to be at the center of your data strategy, implementing data engineering, BI and ML use cases on top of your data lake, without the need of maintaining separate systems and exporting data into external data warehouse solutions. 

 

When you are looking at querying your Lakehouse efficiently with Power BI, there are mainly two strategies to consider: 

 

DirectQuery mode has Power BI connecting to an existing Azure Databricks SQL Warehouse and forwarding each query generated by the BI report directly to Azure Databricks. Import mode imports each table into Power BI storage and query the data from inside the same BI solution. A third mode called Composite allows for the usage of DirectQuery and Import within the same Power BI model - more on this technique in another blog post. 

 

As you can imagine, Import will require you to re-import the data as it is changing, while DirectQuery doesn’t require any maintenance. However, DirectQuery also has its drawbacks for some specific use cases. 

 

To help you choose between those two approaches here is a list of pros and cons you should consider:

 

 

Import mode

DirectQuery mode

Pros

  • Very low latency: advised for cases where you need a very interactive experience.
  • It does not require Azure Databricks SQL Warehouse to query the data once data is imported.
  • It can scale up to datasets of virtually any size and you can leverage the power of Azure Databricks SQL to query your data efficiently.
  • Queries are executed on the data available in the data lake at the moment.
  • No need to refresh a separate data storage. 

Cons

  • This solution is recommended for relatively small or static datasets. E.g., reference data, historical aggregated data. The maximum size of a dataset is defined by Power BI SKU (1GB-400GB).
  • Data may not be up-to-date, and results do not represent the current data in the data lake. It is rather a snapshot from the past. 
  • Data is being duplicated across the two spots and some maintenance is required on both. 
  • Need to govern data both in Databricks and in Power BI.
  • Azure Databricks SQL Warehouse must be up and running whenever a report is requested by a user. Therefore, we recommend using Azure Databricks SQL Serverless for BI workloads to get SQL Warehouse cluster(s) in seconds. 
  • Queries are affected by network latency, which means the overall user interface is less reactive.

 

You can look at the two modes in more detail on this link.

It is worth mentioning that Import mode provides the best performance once data is loaded into in-memory cache. Therefore, Import mode is recommended whenever data volume and data load time suffice existing limitations and use case requirements.

 

Now that we know how to choose between them, let’s look at an actual implementation of both.

 

Connecting Power BI to Azure Databricks SQL

Let’s start by connecting your Azure Databricks SQL Warehouse with Power BI. To do this, we are going to use Partner Connect. Before you start, make sure you have already created the SQL Warehouse you want to use in Azure Databricks SQL. To do this, you can click on SQL Warehouses and click on Create SQL Warehouse.

KatieCummiskeyDatabricks_0-1683042845441.png

 

We recommend using Serverless or Pro SKU on Azure Databricks SQL Warehouse for BI use cases as those generally require good performances from concurrency and query performance point of view. The best performance is provided by Serverless, with low latency on autoscaling and starting a cluster. 

 

Now we can click on Partner Connect and get started by selecting Power BI among the partner list. A side window will open where we will be asked to select the Data Warehouse in a dropdown and download a file. 

 

The file that has been generated is the configuration file for the connection. You can open that file with Power BI Desktop and your connection will be automatically configured. 

 

On the Power BI side, you will be asked to select the Data Connectivity mode. That’s where you can select Import or DirectQuery. See screenshot below.

 

KatieCummiskeyDatabricks_1-1683042845444.png

 

Choose the best option for the architecture type you decided to use. Then an interface will let you select the tables you want to use. After clicking on Load, you are ready to start building your BI report.

 

KatieCummiskeyDatabricks_2-1683042845447.png

 

 

Import Mode: Refreshing Tables

We already introduced the Import mode and the need to refresh tables when using it, so now we are going to see how to refresh tables. When you open a table, you can see when the last time it was refreshed. 

 

KatieCummiskeyDatabricks_3-1683042845449.png

 

There are multiple ways you can refresh tables. You can also set a schedule for the refresh to happen automatically, without human involvement. Here is a link to the official documentation: link.

 

 

More on DirectQuery Mode 

Here we introduce the second approach.

In this case, we want to set our fact tables to DirectQuery mode because fact tables are usually huge and do not fit into memory. But what about dimension tables? Dimension tables are usually used to populate such visuals in Power BI like filters and slicers. If dimension tables are also in DirectQuery mode, we will see those visuals being populated and updated with higher latency and that will make a report interface slower, less reactive. 

 

The recommendation in such cases is to use the Dual mode for dimension tables. This will make sure Power BI uses in-memory cached data for such tables when rendering filter and slicer visuals. At the same time Power BI generates SQL-queries joining fact tables with dimension tables to get up-to-date data. This will ensure more efficient SQL-queries and getting fresh data from the data lake. In other words, when populating filters and slicers a table behaves like in Import mode, and when querying data, it behaves like DirectQuery mode. Hence, the name “Dual” mode. 

 

Keep fact tables in DirectQuery mode to ensure that queries for larger tables are pushed down to Azure Databricks SQL, allowing you to leverage the performance benefits of Azure Databricks SQL engine.

 

Dimension tables in Dual mode will need to be refreshed following the same guidance provided for Import mode. Schedule refreshes to keep your dimension tables current.

 

The table below summarizes the recommended storage modes for different table types, see more details here.

 

Table type

Recommended Storage Mode

Large Fact Table

DirectQuery

Dimension Tables

Dual (not Import!)

Aggregation Table

Import (or DirectQuery)

 

Note that after setting a table to Import mode, you cannot revert it without recreating the table in the Power BI model, potentially requiring updates to table relationships and DAX formulas.

 

Wrap up

This article discussed ways you can use Power BI together with Azure Databricks SQL to query your Lakehouse. We have seen advantages and disadvantages of both Import and DirectQuery mode. 

 

While you can use this content as a guideline to implement your Power BI + Azure Databricks SQL architecture, some additional tunings might be required to maximize the performance of your reports to improve the user experience.


Our next two blog posts, "Power Up your BI with Microsoft Power BI and Azure Databricks Lakehouse: part 2 - Tuning Power BI" and "Power Up your BI with Microsoft Power BI and Azure Databricks Lakehouse: part 3 - Tuning Azure Databricks SQL" will focus on enhancing both Power BI and Azure Databricks SQL Warehouse performance. Stay tuned for tips and tricks to boost your BI solution's efficiency and effectiveness.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.