This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
Azure Data Factory Data Flows perform data transformation ETL at cloud-scale. This blog post takes a look at performance of different source and sink types. I've put our findings below based on performance tests of different source & sink pairs:
Scenario 1
- Source: Delimited Text Blob Store
- Sink: Azure SQL DB
- File size: 421Mb, 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 4 mins end-to-end using memory-optimized 80-core debug Azure IR
- Recommended settings: Current partitioning used throughout
Scenario 2
- Source: Azure SQL DB Table
- Sink: Azure SQL DB Table
- Table size: 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 3 mins end-to-end using memory-optimized 80-core debug Azure IR
- Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
Scenario 3
- Source: Delimited Text Blob Store
- Sink: Delimited Text Blob store
- Table size: 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
- Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
File-based Source / Sink
- Set "current partitioning" on source & sink to allow data flows to leverage native Spark partitioning. This will allow the performance to scale proportionally with an increase in core counts.
- Pre and post-processing operations like "save as single file", "clear folder", and "delete files" will incur additional time in your ETL process.
Azure SQL DB Source / Sink
-
SQL DB Source
- Use "Source" partitioning under Optimize and set the number of partitions equal to the number of cores you are using. Use a high-cardinality column or set of columns as the partition column.
- Use "Input query" to minimize the data, columns, and for pushdown functions.
-
SQL DB Sink
- Make sure that you are using a large enough SQL DB tier for your ETL job to write to the database with enough resources.
- Adding cores to your job will scale the performance proportionally, but you will always be throttled by the ability of the database to serialize data.
- Use current partitioning.
Synapse SQL DW
-
Synapse DW Source & Sink
- Always use "Enable Staging" and increase core count to minimize data processing times
CosmosDB Source / Sink
Make use of the "throughput" option on the CosmoDB source and sink to increase the throughput for the ETL job.