ForEach Loop to CopyData and Capture Metadata in ControlTable – Part 1(Environment Setup)

Posted by

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Azure Data Factory is Azure's cloud ETL service for scale-out serverless data integration and data transformation. Often customers leverage Data Factory to perform bulk load or incremental load activities. Amongst the most commonly used activities for such purposes are Copy activity to copy data among data stores located on-premises and in the cloud & ForEach activity defines a repeating control flow in your pipeline. This activity is used to iterate over a collection and executes specified activities in a loop. The loop implementation of this activity is similar to the Foreach looping structure in programming languages. One of the most common and frequent scenario is copying data from multiple folders and mapping it to multiple databases/schemas/tables.

Kruti_Mehta_1-1666714989079.png

This blog is inspired by the blog "Build metadata driven pipelines in Azure Data Factory" published by our Fast-track Engineer Olga Molocenco. In this series we will also capture the Copy data Activity's metadata in the same control table. This is often required by developers to debug in case of errors during execution instead of tracing back the activities through GUI and validating what went wrong

Errors.jpg

Note: It will only be able to copy the Copydata Activity Errors, 
It will not capture the data if the initial LookUp Activity itself fails

About this solution template

This lab was build to address the following scenario

Copy data from a ADLS Gen2 Account containing multiple folders recursively to multiple databases within a Azure SQL Server. The mapping is retrieved via LookUp activity from a metadata control table also residing within the same Azure SQL Server. The Copydata activity is executed within a ForEach loop. At the end of each loop the Script activity is used to update the metadata within the same metadata control table

Note: You may be tempted to recreate the resources using your own naming conventions & datatypes. 
Please do not attempt to do the same when setting up the lab for the very first time.
This may result in unexpected formatting or execution errors which may also cause validation to fail.
Same applies to the naming convention used for Pipeline and its Sub-Activities.
Once the lab is understood thoroughly for its components and inter-dependencies, feel free to leverage the same for your own purpose modifying its content

 

Environment Setup

1. ADLS Gen2 Account ->(I named my account as “incrementalsource”)

Create a storage account which supports hierarchical namespace (you may choose your own name) with the following sub-folders (use the same naming convention for subfolders & files). This will serve as the source from where data will be copied

Kruti_Mehta_0-1666716449942.png

Create a directory within each folder with the same name

Eg :

Kruti_Mehta_1-1666716449953.png

The text files will be organized as follows within each of these containers

Kruti_Mehta_2-1666716449957.png

Download the text files from here

 

2. Azure SQL Server ->(I named my server as “incrementalsql”)

Create an Azure SQL Server (you may choose your own name) which will contain the HR, finance, ITops databases along with a incrementalmetadata database (use the same naming convention for sdatabases). 

Kruti_Mehta_3-1666716449960.png

Any basic/serverless tier databases will do for this lab with minimum storage.

Use the CREATE SCHEMA script to create hradmin, general schema in HR database and inventory schema in ITops databases respectively (use the same schema names).

Syntax : 

CREATE SCHEMA [schema_name]

Download the create table scripts from here. Ensure you are running it against the right databases

 

3. Inserting Metadata Table rows

By default, copy activity maps source data to sink by column names in case-sensitive manner. If sink doesn't exist, for example, writing to file(s), the source field names will be persisted as sink names. If the sink already exists, it must contain all columns being copied from the source. Such default mapping supports flexible schemas and schema drift from source to sink from execution to execution - all the data returned by source data store can be copied to sink.

 

As you may have noticed some column names in the text files(SourceFiles) are different that what they are defined in the referencing target tables(TargetTables). This was done on purpose. There is a possibility that your source location may contain columns which may not be declared with the same name in the target tables. If you have static sources and static sink this mapping is easy via GUI. But what we are attempting here is to run a loop where the source and sink will keep changing as well as the column names and datatypes. We have hence created Parameterized Json scripts which we are being stored inside the metadata table under the TargetJson column. This will take care of the mapping when the pipeline is executed.  Azure Data Factory copy activity will perform schema and data type mapping from source data to sink data. 

How we created these custom scripts ?

You can essentially create the content by mapping a single static source to static sink. When you select Import Schema option it populates the columns and datatypes for you as shown below..

Kruti_Mehta_4-1666716449972.png

Once this is done, click on the view source code button

Kruti_Mehta_5-1666716449974.png

The entire section which appears within the parenthesis of translator parameter as shown highlighted below is the content used to create these customized scripts for the lab.

Kruti_Mehta_6-1666716449978.png

The resultant file may look something like this

Kruti_Mehta_7-1666716449991.png

The Json files which were used as values to the TargetJson columns are available here

We will now insert data into the MetadataControlTable using this script.

 

Using custom code we now have our environment ready to setup the Azure Data Factory environment. This will be covered in the next article. And finally we will show how to create the pipeline with the activities running in loops in the last article for this series

 

References:

Don't forget to share a KMehta_0-1655357052055.jpeg  if this helps

Credit: Thanks Gary Lee, Nathan Widdup, Paul Burpo for reviews and guidance

FastTrack for Azure:  Move to Azure efficiently with customized guidance from Azure engineering. 

FastTrack for Azure – Benefits and FAQ | Microsoft Azure 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.