Performance Tuning ADF Data Flow Sources and Sinks

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Azure Data Factory Data Flows perform data transformation ETL at cloud-scale. This blog post takes a look at performance of different source and sink types. I've put our findings below based on performance tests of different source & sink pairs:

Scenario 1

  • Source: Delimited Text Blob Store
  • Sink: Azure SQL DB
  • File size: 421Mb, 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 4 mins end-to-end using memory-optimized 80-core debug Azure IR
  • Recommended settings: Current partitioning used throughout

Scenario 2

  • Source: Azure SQL DB Table
  • Sink: Azure SQL DB Table
  • Table size: 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 3 mins end-to-end using memory-optimized 80-core debug Azure IR
  • Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink

Scenario 3

  • Source: Delimited Text Blob Store
  • Sink: Delimited Text Blob store
  • Table size: 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
  • Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down  partitions based on size of Azure IR (i.e. number of worker cores)

File-based Source / Sink

perf1.png

 

  • Set "current partitioning" on source & sink to allow data flows to leverage native Spark partitioning. This will allow the performance to scale proportionally with an increase in core counts.
  • Pre and post-processing operations like "save as single file", "clear folder", and "delete files" will incur additional time in your ETL process.

Azure SQL DB Source / Sink

perf2.png

  • SQL DB Source
    • Use "Source" partitioning under Optimize and set the number of partitions equal to the number of cores you are using. Use a high-cardinality column or set of columns as the partition column.
    • Use "Input query" to minimize the data, columns, and for pushdown functions.
  • SQL DB Sink
    • Make sure that you are using a large enough SQL DB tier for your ETL job to write to the database with enough resources.
    • Adding cores to your job will scale the performance proportionally, but you will always be throttled by the ability of the database to serialize data.
    • Use current partitioning.

Synapse SQL DW

perf5.png

  • Synapse DW Source & Sink
    • Always use "Enable Staging" and increase core count to minimize data processing times

CosmosDB Source / Sink

perf4.png

Make use of the "throughput" option on the CosmoDB source and sink to increase the throughput for the ETL job.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.