Delta Lake on Azure

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

This is a guest post by Nikhil Gupta, Partner Solutions Architect, Databricks

As customers start to move towards the lakehouse paradigm one of the necessary requirements is to have data stored in a format that is open sourced and provides properties such as transaction support, schema enforcement/governance and BI support. Delta stands out on all the above requirements and thus becomes the best in class format for storing your data in Azure Data Lake Store.

Delta is an open-source storage layer on top of your data lake that brings ACID transaction capabilities on big data workloads. In a nutshell, Delta Lake is built on top of the Apache Parquet format together with a transaction/change log mechanism. This enables Delta Lake to overcome challenges traditional parquet has in terms of delete, upserts, merge, etc. while providing additional capabilities such as time travel.

So as per the architecture diagram below, once the data in the data lake is stored in Delta Format it can be accessed by a variety of Azure services.

How does Delta integrate with other Azure Services?

Azure Databricks: Azure Databricks natively supports Delta Lake. With Azure Databricks you can use many enhanced capabilities such as Delta caching. With Azure Databricks you can use SQL, Python, R or Scala to query the delta lake. We would recommend going through below blogs to get more insights into Delta Lake with Azure Databricks:
- Productionizing Machine Learning with Delta Lake
- Diving Into Delta Lake: Unpacking The Transaction Log
Azure Synapse Analytics: Azure Synapse Analytics (Spark component in public preview) is compatible with Linux Foundation Delta Lake so you can use Synapse Spark to read and write data in your data lake stored in Delta format. The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET.
Azure Data Factory: Azure Data Factory (ADF) supports Delta Lake in the following ways:
- Copy activity supports Azure Databricks Delta Lake connector to copy data from any supported source data store to a Azure Databricks Delta Lake table, and from Delta Lake table to any supported sink data store.
- ADF mapping data flows support generic Delta Lake format on Azure Storage as source and sink to read and write Delta files for code-free extract, transform and load (ETL), and runs on managed Azure Integration Runtime (IR).
- Azure Databricks Notebook Activities support orchestrating your code-centric ETL or machine learning workload on top of Delta Lake.
Azure HDInsight: Azure HDInsight supports both Apache Spark and Hive. You can connect to a Delta Lake by downloading relevant open source drivers.
Power BI: With the new Azure Databricks Power BI connector you can query Delta Lake tables directly using Azure Databricks clusters.

Next Steps
Get started using Delta Lake by attending an upcoming Azure Databricks event.

Leave a Reply Cancel reply