Azure Data Factory Patterns and Features for the Azure Well-Architected Framework

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

The Azure Well-Architected Framework (WAF) helps ensure that Azure workloads are reliable, stable, and secure while meeting SLAs for performance and cost. The WAF tenets are:

Cost Optimization - Managing costs to maximize the value delivered.
Reliability - The ability of a system to recover from failures and continue to function.
Operational Excellence - Operational processes that keep a system running in production.
Performance Efficiency - The ability of a system to adapt to changes in load.
Security - Protecting applications and data from threats.

In this blog post, we’ll focus on features and patterns to implement within Azure Data Factory (ADF) that align to the Azure Well-Architected Framework for data workloads.

The following table summarizes some features and patterns you can implement within ADF which help support WAF tenets:

	Cost Optimization	Reliability	Operational Excellence	Performance Efficiency	Security
Pipeline quality, resiliency and error handling	X	X	X
Meta-data driven pipelines		X	X
Triggers		X	X
Pipeline and data flow performance	X	X	X	X
Performance SLAs – cost vs performance	X		X	X
Right-sizing Azure IRs	X	X		X
Right-sizing Self-Hosted Integration Run Times	X	X	X	X
Optimize compute sources leveraged by ADF	X	X	X	X
Leverage Git Integration with Data Factory		X	X
Managed VNet, Private Endpoints and Private Link			X		X
Permissions			X		X
Use Managed Identity for Azure Resource access			X		X
Store credentials and other values in Azure Key Vault			X		X
Training on ADF	X	X	X	X	X

Pipeline quality, resiliency, and error handling

Pipeline quality, resiliency and error handling ensure that activities and pipelines are successful and that failures are handled appropriately. ADF is billed by usage; pipelines and pipeline activities incur costs even when a failure occurs without delivering expected results. Failures also cause delays in providing data necessary for your organization. Design pipelines to minimize failures, monitor for errors or potential issues, and handle failures/issues to take appropriate action. High quality and resilient pipelines support WAF Cost Optimization, Reliability, and Operational Excellence. Below are some ADF features that can help:

Configure Retry and Retry interval settings on Activities on activities that can error due to timing:

Use Validation Activity to ensure a pipeline only continues execution if a reference dataset exists, meets specific criteria, or a timeout has been reached.
Use Activity Dependencies to redirect pipeline activity failures to handle errors, send notifications, wait, or stop and fail the pipeline.

Meta-data driven pipelines

Meta-data driven pipelines leverage dynamic expressions and parameters in ADF, making your Data Factory artifacts reusable and reducing the need to create new activities when a new data source needs to be loaded or transformed. This saves development time, allowing you to add new entities in your ingestion workflow without making changes to your Data Factory. Meta-data driven pipelines support Cost Optimization through reducing development time as well as reliability and operational excellence by following a successful pattern with less code to maintain and fewer errors.

Get started with the Copy Data Tool.
Follow this pattern to build metadata driven pipelines in Azure Data Factory.
Learn how to use parameters and expressions in Azure Data Factory and in data flows.
Review this whitepaper on parameters in ADF
Reference columns without hard-coding names using column patterns in mapping data flow.

Triggers

Triggers execute your pipeline according to schedules or events. Schedules include time clock or windowing. Events include the addition or deletion of an Azure Blob Storage file or events sent via Azure Event Grid, triggering your data pipelines when the event occurs. Configure them according to your source data processes and data latency needs. Triggers support WAF Reliability and Operational Excellence by getting data to your end users or processes in a timely and efficient manner.

Performance SLAs – cost vs performance

Consider business requirements when it comes to data latency vs ingestion and transformation performance. What are the dependencies among ingested data entities? How quickly must data be available in its final form? Do your end users need real-time, near real-time data or just scheduled updates? What are the performance expectations when querying data? Data infrastructure may need to be scaled up or out to meet critical business needs. Keep this in mind as your read through the rest of this article on building WAF into ADF.

Pipeline activities for performance

Copy data activities are one of the primary pipeline activities. Improve performance by moving raw data to Azure before transforming data or leverage staged copy within your Copy data activity when source data is compressed or ingesting data into Azure Synapse Analytics via PolyBase, copying data from/to Snowflake, or ingesting data from Amazon Redshift/HDFS.

Also ensure your data sources and/or sink are optimized for performance, depending on the type of data. For example, source data file size, partitioning, and folder structure can improve ingestion performance according to the data or file type. Pipeline activity performance supports WAF performance efficiency and cost optimization.

Dataflow transformations for performance

The Microsoft ADF team has published a series of articles mapping dataflow performance. Start with the overview, Mapping data flow performance and tuning guide, and continue on to optimizing sources, optimizing sinks, optimizing transformations, and optimizing pipeline performance of your data flows. Optimizing dataflows supports WAF performance efficiency and cost optimization.

Right-sizing Azure IRs and TTL

Azure integration runtimes (Azure IR) utilized in pipeline activities (such as Copy Data) automatically scale. Integration runtimes for data flow activities can be configured to leverage a set number of cores or memory optimized clusters in the settings for the Data flow pipeline activity.

Time to live (TTL) keep spark clusters alive to run subsequent pipelines, eliminating the time to spin up a new spark cluster to be leveraged by the next data flow run.

Ensuring Azure IRs are right-sized for your data flow activities and leveraging TTL support WAF Cost optimization, reliability and performance efficiency.

Right-sizing Self-Hosted Integration Runtimes

Self-hosted Integration Runtimes (SHIRs) are used to move data to or from data sources that are on-premises or in an Azure Vnet. Ensure that the SHIRs are sized properly, meeting your performance requirements, but are not underutilized as well. SHIRs may need to be scaled-up, scaled-out, or downsized as time progresses.

Ensuring self-hosted IRs are scaled appropriately for your copy data activities support WAF cost optimization, reliability and performance efficiency.

Optimize compute sources leveraged by ADF

Data Factory include pipeline activities that process data on other compute environments other than Azure IR or SHIR. These are:

Compute environment	Activities
On-demand HDInsight cluster or your own HDInsight cluster	Hive, Pig, Spark, MapReduce, Hadoop Streaming
Azure Batch	Custom
ML Studio (classic)	ML Studio (classic) activities: Batch Execution and Update Resource
Azure Machine Learning	Azure Machine Learning Execute Pipeline
Azure Data Lake Analytics	Data Lake Analytics U-SQL
Azure SQL, Azure Synapse Analytics, SQL Server	Stored Procedure
Azure Databricks	Notebook, Jar, Python
Azure Function	Azure Function activity

Make sure these compute environments are right-sized to meet your data latency needs and consider dynamic scaling, where applicable, as well. Right sizing other compute environments support WAF cost optimization, reliability and performance efficiency.

Leverage Git Integration with Data Factory

Leverage Azure Data Factory GIT Integration with Azure DevOps or GitHUb for source control. Git integration offers a superior authoring experience, allowing you to save your changes as you build your pipelines and data flows vs developing directly against the ADF service. GIT Integration also enables source code versioning, continuous integration and delivery and automated publishing when deploying code changes to other environments. Git integration in ADF supports WAF Reliability and Operational Excellence.

Managed VNet, Private Endpoints and Private Link

Managed VNet deploys Azure integration runtimes into a virtual network that is managed by Microsoft. Data source and sink are then connected through Private Endpoints, keeping your data isolated and secure. Azure Private link for Data Factory provides secure communication between your Data Factory and your VNet. These features ensure that all traffic is running on Microsoft’s private backbone network rather than through the public internet. Managed VNet, private endpoints and private links support WAF Operational Excellence and Security.

Permissions

Set permissions for your Data Factory following the principle of last privilege, contributing to WAF Operational Excellence and Security.

Use Managed Identity for Azure Resource access

Managed Identities give services like ADF access to other Azure resources that support Azure Active Directory. Managed identities eliminate the need to manage credentials. For example, you can grant your Azure Data Factory access to read secrets from an Azure Key Vault. Managed identities support WAF Operational Excellence and Security.

Store credentials and other values in Azure Key Vault

When Managed Identities are not an option for your data stores, store credentials as secrets in Azure Key Vault. Secrets are encrypted and values are not exposed in ADF. You can create secrets to store other critical information needed by your Data Factory activities and as access needed in your pipeline.

Secret names can be parameterized as well in ADF, allowing you to change the secret name to be accessed for different ADF environments:

Storing credentials and other values in Key Vault supports WAF operational excellence as well as security since passwords, access keys or other values can be changed without having to make changes in your ADF environment.

Training

Well-trained Data Engineers know how to build and deploy cost optimized pipelines that are performant, reliable, and secure and meet goals for operational excellence. Microsoft offers free resources for ADF Training. Below are some places to get started:

The Well-Architected framework provides guidelines to help Azure customers have secure, performant, reliable, and cost-effective workloads.

Applying the Azure WAF to your ADF workloads is critical and should be considered during initial architecture design and resource deployment. But how do you ensure that your ADF environment still meets WAF as workloads grow and evolve? Read the next article on this series Monitoring Data Factory for the Azure Well-Architected Framework

Azure Data Factory is an evolving tool with new features being added every month and new patterns being ideated as well. Do you have features or patterns that you love and have helped your workloads become more robust or efficient? We'd love to hear about them! Please post in the comments!