Delta Lake on HDInsight

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction

Azure HDInsight is a managed, full-spectrum, open-source analytics service in the cloud for enterprises. HDInsight Apache Spark cluster is parallel processing framework that supports in-memory processing, it is based on Open-Source Apache Spark.

Apache Spark is evolving; it’s efficiency and ease of use makes it a preferred big data tool among big data engineers and data scientists. There are few essential features missing from the Spark, one of them is A(Atomicity)C(Consistency)I(Isolation)D(Durability) transaction. Majority of databases supports ACID feature out of the box, when it comes to Storage layer (ADLS Gen2) it is hard to support similar level of ACID feature provided by databases.

 

Delta layer is a computer layer; that would sit on top of your storage layer. Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.

 

This blog is not about Delta Lake; we will talk more about how you can leverage delta with HDInsight Spark Cluster, few code snippet and require configurations.

 

Before we jump into code and require configurations, it is good for you check your Spark version from Ambari user interface from the HDI cluster. You need to pick the right delta lake versions based on your cluster Spark Version. The following table lists Delta Lake versions and their compatible Apache Spark versions:

 

HDI Version

Spark Version

Delta Lake Version

4.0

Spark 2.4.4

< 0.7.0

5.0

Spark 3.1.2

1.0.x

 

HDInsight - Delta Lake Configuration

 

Before we jump into code and configurations; we need to look into these two Sextendibility configurations provided by Spark:

 

  1. spark.sql.extensions – It is used to configure Spark Session extensions, by providing the name of the extension class.
  2. spark.sql.catalog.spark_catalog – This plugin configuration is used to configure custom catalog implementation. You can find the current catalog implementation from CatalogManager spark.sessionState.catalogManager.currentCatalog. The Spark 3.x uses SessionCatalog  as default catalog.

When you would like to use Delta Lake on Spark 3.x on HDI 5.0, you need to configure sql extensions and delta lake catalog with following values:

 

Configuration Property

Delta Lake Value

Description

spark.sql.extensions

io.delta.sql.DeltaSparkSessionExtension

An extension for Spark SQL to activate Delta SQL parser to support Delta SQL grammar.

spark.sql.catalog.spark_catalog

org.apache.spark.sql.delta.catalog.DeltaCatalog

This replaces Spark’s default catalog by Delta Lake DeltaCatalog.

 

The above configurations need to be provided as part of the Spark Configuration before any Spark session is created. Apart from the above Spark configurations, the Spark Application uber jar should provide Delta Lake dependency.

 

Working with Spark 2.4.x with HDI 4.0 we just need to supply Delta Lake dependency, no additional spark configurations. We need to Shade com.fasterxml.jackson to avoid class loading conflicts due to duplicate classes on the cluster class path.

 

Example Code

You can clone the example code from GitHub, the code is written in Scala. You can run example code using anyone of this option:

  1. Copy the application jar to the Azure Storage blob associated with the cluster.
    1. SSH to Headnode and run Spark-Submit from the headnode
    2. Or Using Livy API
  2. or Use Azure Toolkit for IntelliJ

The example application will generate stdout logs and delta lake parquet files with commit logs. The output examples are listed on GitHub.

 

Summary

Delta Lake is an open-source storage framework that extends parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling. Delta lake is fully compatible with Apache Spark APIs. Since the HDInsight Spark cluster is an installation of the Apache Spark library onto an HDInsight Hadoop cluster, the user can use compatible Delta Lake versions to take benefits of Delta Lake on HDInsight.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.