Demystifying Data Ingestion in Azure Synapse Data Explorer

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

synapse-cse-logo.png

Author: @devangshah is a Principal Program Manager in the Azure Synapse Customer Success Engineering (CSE) team.

 

Introduction

Following up on the previous blog Demystifying Azure Synapse Data Explorer, this blog continues the series of demystifying Data Explorer with a focus on data ingestion.

 

Azure Synapse Data Explorer provides high-speed, low-latency data ingestion enabling real-time analytics on streaming data. The graphic below shows the data formats, ingestion methods, in-place transformation, tuning, and monitoring capabilities available in Azure Synapse Data Explorer to achieve high-speed, and low-latency data ingestion throughput.

 

data-ingestion.png

 

Example with a business scenario

Continuing the ACME Chemical example from the previous blog, here’s the data ingestion scenario:

  1. ACME Chemicals has an app developed using C# that generates logs and would like to store these logs in Azure Synapse Data Explorer to use them for analyzing the root cause of performance issues and bottlenecks.
  2. ACME Chemicals has sensors in its plants that send values against different measurements every 1 second. This time series data in JSON format needs to be streamed via Event Hub into Azure Synapse Data Explorer. This JSON data contains arrays of JSON objects. ACME Chemicals also needs to migrate 1 year of historical time-series data from its plants into Azure Synapse Data Explorer stored as PARQUET files in Azure Data Lake Store Gen 2. This data is used for near real-time predictive analytics.

 

Key aspects of data ingestion in Data Explorer

Based on the graphic above and the explanation of each component of the graphic, we will derive the most relevant ingestion methods to ingest the logs, time series, and historical data into Azure Synapse Data Explorer.

  1. Data Formats: Azure Synapse Data Explorer supports ingestion from 14 (and growing) different file formats supporting all the most widely used formats such as parquet, JSON, CSV, w3clogfile, etc. We will soon announce support for the Delta Parquet format as well. A complete list of supported formats is available at Data formats supported by Azure Data Explorer for ingestion. | Microsoft Learn
  1. Ingestion Methods: Azure Synapse Data Explorer enables ingesting data using streaming and batch mode and includes a comprehensive portfolio of connectors and plugins to ingest data.
    1. In the streaming mode, you can achieve near real-time latency for a small set of data per table. The maximum data size per request is 4MB (uncompressed).
    2. For bigger file sizes, Azure Synapse Data Explorer supports batch ingestion that can support file sizes up to 1GB (uncompressed). The batch ingestion mode supports setting thresholds on how a batch is sealed. By setting lower thresholds, one can also achieve streaming-like data ingestion using micro-batching. See more details in the Ingestion Tuning section below. With batch ingestion, in our internal benchmarks and external customer implementations, Azure Synapse Data Explorer achieves ingestion throughput of up to 200-250 MB/second on a two-node 16-core cluster.
    3. Connectors, Plugins, SDKs: Azure Synapse Data Explorer integrates with many Azure-native, open-source technologies to support ingestion. ADX supports the following methods in batch and streaming ingestion modes.

streaming-screenshot.png

- SDKs: ADX provides SDKs in popular programming language frameworks such as .NET, JAVA, Python, GO, Node.js.

- Azure Native Services: ADX creates data connections with Azure Event Hubs and Azure IoT Hubs supporting streaming as well as batch ingestion with Managed Identities.

- Open-source technologies: ADX is also available as Kafka Sink and Open Telemetry collector

 

The following methods are supported for only batch ingestion mode.

batch-screenshot.png

- Azure Native Services: Batches of data can be ingested from Azure Data Factory, Synapse Pipelines, Azure Stream Analytics

- Cloud Data Sources: Using Azure Event Grid, blobs from Azure Storage Account, Azure Data Lake Store Gen 2 can be ingested in batches into Azure Data Explorer. Azure Synapse Data Explorer also supports ingestion from AWS S3 buckets.

- Open-source technologies: Batch ingestion of data is supported by Apache Spark (using Kusto Spark Connector), Logstash, and the Telegraf agent.

- Tools: Azure Synapse Data Explorer provides UX based ingestion wizard to ingest data from the local computer, Azure Storage Accounts Bloc Container, Azure Data Lake Store Gen 2 Container, and Azure Event Hub. ADX also provides LightIngest to ingest historical data into Data Explorer.

Depending on your data source, data format, and latency requirements, you can choose one or a combination of these methods to ingest data into data explorer. You can also use the comparison table for choosing the most suitable ingestion method for your scenario.

 

  1. In-place Transformation: Azure Synapse Data Explorer also provides in-place data transformation capability using Update Policy feature. Putting it differently, Azure Synapse Data Explorer can also be considered a simple Extract-Load-Transform like tool to perform high-speed, low-latency transformations on streaming data. Some examples of such in-place transformation are:
    • Extract complex JSON messages into a relational, table structure
    • Parsing log and text files to extract meaningful data
    • Parsing messages to extract key value pairs
    • Data cleaning
    • Data grooming and curation
    • Distribute the raw stream of data into multiple target tables

Please refer to Update policy - Azure Data Explorer | Microsoft Learn for more information.

  1. Ingestion Tuning
    1. Ingestion Data Mapping: Data mappings are used during ingestion to map incoming data to columns inside tables. Data Explorer supports different types of mappings, both row-oriented (CSV, JSON, AVRO and W3CLOGFILE), and column-oriented (Parquet and ORC). Please refer to Data mappings - Azure Data Explorer | Microsoft Learn for examples on creating data mapping of CSV, JSON files.
    2. Ingestion Properties: With some ingestion methods, Azure Synapse Data Explorer also enables providing ingestion properties that aid the ingestion process. For example, when you are ingesting historical data. You can use the ‘CreationTime’ property to instruct the cluster to take the original creation time from the dataset instead of the date on which the data is being ingested. By default, the cluster records the ingestion time when performing ingestion. Please refer to Data ingestion properties for Azure Data Explorer | Microsoft Learn for the complete list of ingestion properties.
    3. Batching Policy: During the ingestion process, the service optimizes for throughput by batching small ingress data chunks together before ingestion. Batching reduces the resources consumed by the ingestion process and doesn't require post-ingestion resources to optimize the small data shards produced by non-batched ingestion. For more information on setting up a batching policy, please refer to Ingestion Batching policy optimizes batching in Azure Data Explorer | Microsoft Learn
    4. Cluster Sizing: The size of your Azure Synapse Data Explorer cluster has an influence on the ingestion throughput you can achieve. In an Azure Synapse Data Explorer cluster, the ingestion process, query serving, and background processes of data management share the same compute resources, it is advisable to right-size your cluster based on the expected data ingestion volume expected per day.
    5. Capacity Policy: Azure Synapse Data Explorer sets default limits on the number of concurrent ingestions that can be performed on each cluster as well as the percentage of CPU that is reserved for ingestion. These limits can be altered using control commands
  1. Ingestion Monitoring
    1. Ingestion Insights: In the Insights blade on Azure portal in Azure Synapse Data Explorer’s sidebar menu, you can get granular insights into the ingestion process completion. All successful and failed ingestions by database and by table are reported here. In the case of batch ingestion, it also provides detailed component-level charts to monitor the batch ingestion process through various stages. Please note that you need to export Diagnostic Settings for the following categories (Succeeded ingestion, Failed ingestion, Ingestion batching) to Log Analytics workspace to generate the insights as shown in the graphic below.
    2. Ingestion Failures: Azure Synapse Data Explorer also provides a control command to get the list of all ingestion failures. Please refer to Ingestion failures - Azure Data Explorer | Microsoft Learn for more information. Please refer to Ingestion error codes in Azure Data Explorer | Microsoft Learn when using SDKs to ingest data. Please refer to Streaming ingestion failures - Azure Data Explorer | Microsoft Learn when using Streaming Ingestion

 

Example: Business scenario solved

  1. Using the C# SDK, ACME Chemicals can adapt its existing app to ingest logs directly into Azure Synapse Data Explorer
  2. Using Data Connection between Azure Synapse Data Explorer and Event Hub, ACME Chemicals can stream its time-series data in JSON format. Using Ingestion Mapping, ACME Chemicals can store some of the JSON fields directly as columns in the table. For nested arrays, ACME Chemicals can use Update Policies to expand these JSON arrays and store them as individual records.
  3. Using LightIngest tool, ACME Chemicals can ingest historical data into Azure Synapse Data Explorer from Azure Data Lake Store Gen 2. ACME Chemicals can generate the necessary commands for LightIngest using Ingestion Wizard. As this is a historical data ingestion, ACME Chemicals will also need to provide a creation time pattern so that Azure Synapse Data Explorer can partition the data appropriately.

 

Summary

This blog summarizes the various aspects that contribute to streaming data ingestion pipelines with Azure Synapse Data Explorer. With the support for a wide variety of data formats, data sources, methods, steaming and batch modes, Data Explorer can unlock complex data ingestion and transformation scenarios on log and time-series type of data that arrives at high velocity, high volume, and in diverse formats.

 

Read further to learn more about how Ingestion Works In Data Explorer.

Ingestion in data explorer is performed by two services: (1) Data Management Service (DM) and (2) Engine service. A typical deployment of Azure Data Explorer contains two sets of compute nodes with one set for DM services and another one for Engine. The Engine service is responsible for processing the incoming raw data and serving user queries. The DM service is responsible for connecting the Engine to the various data pipelines, orchestrating, and maintaining continuous data ingestion processes from these pipelines (including robust handling of failure and backpressure conditions), and invocation of the periodic data grooming tasks on the Engine cluster.

In a typical Azure Synapse Data Explorer cluster, the most frequent transactions are related to data ingestion (creates a new shard and appends a shard reference to the table metadata) and retention (deletes a shard when it falls out of the chronological “sliding window”). Following are the detailed steps of the data ingestion transaction:

  1. An ingestion command arrives at the Admin node; it specifies the target table and the list of data sources (e.g. list of CSV file URL’s).        
  2. Admin finds an available Data node to perform the processing and forwards the ingestion command to this node.
  3. The data node fetches and processes the data sources, creates a new shard, writes its artifacts to the blob storage, and returns the description of the new (uncommitted) shard to the Admin node
  4. The Admin node adds the new shard reference to the table metadata and commits the delta snapshot of the database metadata.

When the engine starts processing a new query, it takes a snapshot of the current metadata and attaches it to the query until it is completed.

 

For a deep dive into this, please refer to Azure Synapse Data Explorer Technical Whitepaper.

 

Our team publishes blog(s) regularly and you can find all these blogs here: https://aka.ms/synapsecseblog  

For deeper level of understanding of Synapse implementation best practices, please refer to our Success By Design (SBD) site: https://aka.ms/Synapse-Success-By-Design  

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.