Part 4 – Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process

This post has been republished via RSS; it originally appeared at: Healthcare and Life Sciences Blog articles.

azure_pipelines.jpg

Background 

This post is the next post in the series Unlock the Power of Azure Data Factory: A Guide to Boosting Your Data Ingestion Process. This also happens to overlap and is included in the series on YAML Pipelines.  All code snippets and final templates can be found out on my GitHub TheYAMLPipelineOne. For the actual data factory, we will leverage my adf_pipelines_yaml_ci_cd repository. 

 

Introduction 

After reading parts 1-3 on Unlock the Power of Azure Data Factory one may be left with the next steps of how to take what was provided and convert it to an enterprise scale. Terminology and expectations are key so let’s outline what we would like to see from an enterprise-scale deployment: 

  • Write once reuse across projects. 
  • Individual components can be reused. 
  • Limited manual intervention. 
  • Easily updated. 
  • Centralized definition. 

Depending on where your organization is at in your pipeline and DevOps maturity this may sound daunting. Have no fear as we will walk you through how to achieve this with YAML templates for Azure Data Factory. At the end of this piece, you should be well equipped to create a working pipeline for Azure Data Factory in a manner of minutes. 

 

Set Up 

To assist in the goals outlined above for enterprise scale deployments, I recommend having a separate repository for your YAML templates that resides outside of your Data Factory. This will help check off the boxes on a centralized definition, write once and reuse across projects, easily updated, and individual components that can be reused.  For more context on this check out my post on Azure DevOps Pipelines: Practices for Scaling Templates.

 

Our individual Data Factories will each have a dedicated CI/CD pipeline which will reference the separate repository we are putting the YAML templates in. This can be achieved natively in Azure DevOps. Furthermore, it is not unheard of for larger scale organizations to have a “DevOps Team” or a team responsible for pipeline deployments. If this is the case in your organization, you can think of this other team as “owning” the centralized repository. 

 

Templating Up  

For those who have read my posts on this topic both on the Microsoft Health and Life Sciences blog or my personal blog this shouldn’t be a new concept. For those unaware “Templating Up” is the process by which we outline the individual build/deploy steps into tasks.  

 

This process will give us the individual tasks required for a build/deployment and visually provide us with what tasks can be repeated. An additional benefit of breaking this down in such a manner is we are left with task templates that can be reused outside our given project. More on that in a minute. 

 

Recapping from the previous Part 3 post, here are the Microsoft Azure DevOps tasks and associated documentations which the process will require: 

  1. Install Node - In order to leverage a Node Package Manager (npm) it would make sense to have node installed on the agent. 
  2. Install NPM - Now that node is installed it's time to install npm so we can execute our package. 
  3. Run NPM package - This is where the "magic" happens, and the creation of our ARM template will occur. The nuance here is the requirement of the package to have the resource ID of a data factory. If adhering to a true CI/CD lifecycle what is in the deployed DEV instance if we want to deploy to future environments. 
  4. Publish Pipeline Artifacts - This task will now take the output from the npm package execution as well code in the repository and create a pipeline artifact. This is key as this artifact is what we will use for our deployment stages. 

The steps required for deployment: 

  1. Stop Azure Data Factory Triggers – This is a PowerShell script created by the deploy process which will ensure our triggers are not executing during deployment. 
  2. Azure Resource Manager (ARM) Template Deployment – The ARM template published as part of the build process will now be deployed to an environment. We will need to provide the opportunity to include a parameter file as well as override parameters if needed. 
  3. Start Azure Data Factory Triggers – After a successful deployment we will want to start the Azure Data Factory triggers with the same script we used to stop the ADF triggers. 

Now, perhaps the most important step in templating up, is identifying which steps are unique to Azure Data Factory. This answer, as surprising as this might sound, is none of these are specific to Data Factory.  Installing node, npm, executing npm, publishing pipeline artifacts, executing PowerShell, and running ARM deployments are all platform agnostic tools. Let’s keep that in mind as we start to template this out. 

 

Build Task

When we are creating reusable task templates the goal is that the template will perform exactly one task. Even if that task is as simple as 5 lines of code, the template should be one task as it is the lowest foundational block of a YAML Pipeline. As such by scoping and limiting it to one task, we can optimize that this same task is reused across multiple pipelines. 

node_install_task.yml 

 

 

parameters: - name: versionSpec type: string default: '16.x' steps: - task: NodeTool@0 inputs: versionSpec: ${{ parameters.versionSpec }} displayName: 'Installing Node Version ${{ parameters.versionSpec }}'

 

 

Defaulting the parameter to a new version can save developers a parameter definition yet also provides the ability to override future implementations.

npm_install_task.yml 

 

 

parameters: - name: verbose type: boolean default: false - name: packageDir type: string default: '$(Build.Repository.LocalPath)' steps: - task: Npm@1 inputs: command: 'install' verbose: ${{ parameters.verbose }} workingDir: ${{ parameters.packageDir }} displayName: 'Install npm package'

 

 

One thing I try and do with these tasks is to provide all available inputs as parameters and set the default there. In this case we are doing it with the ‘verbose’ parameter. Again, this task can easily be reused for any pipeline that will require a npm install. 

npm_custom_task.yml 

 

 

parameters: - name: customCommand type: string default: '' - name: packageDir type: string default: '' - name: displayName type: string default: '' steps: - task: Npm@1 displayName: ${{ parameters.displayName }} inputs: command: 'custom' customCommand: ${{ parameters.customCommand }} workingDir: ${{ parameters.packageDir }}

 

 

 

This task will require the location of the package.json that would have been created as part of setting up your repository for Data Factory CI/CD. For a refresher this is the contents of the package.json 

 

If one is astute and quick to notice this feels very similar to the npm install task directly above. Both tasks leverage the Npm@1 task. The difference here is when specifying a command with the value ‘custom’ the ‘customCommand’ property immediately becomes required.

 

Thus, our task template will require different inputs, and this leads to the delineation of a second task template being required. 

Notice that the custom task has an input customCommand. This input is only required when the command = `custom`. Well, that’s what we have here so based on that requirement a separate task template is required. 

ado_publish_pipeline_task.yml 

 

 

parameters: - name: targetPath type: string default: '$(Build.ArtifactStagingDirectory)' - name: artifactName type: string default: 'drop' steps: - task: PublishPipelineArtifact@1 displayName: 'Publish Pipeline Artifact ${{ parameters.artifactName }} ' inputs: targetPath: ${{ parameters.targetPath }} artifact: ${{ parameters.artifactName }}

 

 

 

This task is required to attach the compiled artifact to the pipeline. This will let the pipeline re-use the artifact in future stages. This task is fundamental for any artifacts-based deployments in Azure DevOps. 

 

Task Summary  

Looking back on these tasks I want to emphasize something. These tasks are 100% agnostic of Data Factory. That means we’ve just created tasks that can be reused from JavaScript builds which leverage npm all the way to infrastructure deployments that will need the publish pipeline artifact task. 

 

Build Job 

Since this is just a build process, we just need a job template that will call each one of these tasks. Something to consider with Azure DevOps jobs is by default they will run in parallel. This job will act as the orchestrator of these tasks. To ensure optimal reusability we have to define the various inputs as parameters.  

adf_build_job.yml 

 

 

parameters: - name: packageDir type: string default: '' - name: dataFactoryResourceID type: string default: '' - name: regionAbrv type: string default: '' - name: serviceName type: string default: '' - name: environmentName type: string default: '' - name: adfDir type: string default: '' jobs: - job: 'adf_${{ parameters.serviceName }}_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}_build' steps: - template: ../tasks/node_install_task.yml - template: ../tasks/npm_install_task.yml parameters: packageDir: ${{ parameters.packageDir }} - template: ../tasks/npm_custom_command_task.yml parameters: packageDir: ${{ parameters.packageDir }} displayName: 'Run ADF NPM Utility' customCommand: 'run build export ${{ parameters.adfDir }} ${{ parameters.dataFactoryResourceID }}' - template: ../tasks/ado_publish_pipeline_task.yml parameters: targetPath: ${{ parameters.packageDir }} artifactName: 'ADFTemplates'

 

 

For this to work across data factories the biggest pieces to parameterize will be the working directory the package.json is located in and the resourceID of the Data Factory which the utility will run against. Effectively the Data Factory Resource ID will belong to the Data Factory in your lowest environment. This ties back to the concept covered in Part 2 where we are promoting an application or package from the lowest environment through to our production environment. By creating these as parameters we are parametrizing this job to be used across multiple data factories. 

 

Build Stage 

For building the artifacts that we will use across environments we require only one stage. This stage’s purpose is first to ensure our Data Factory changes will compile correctly and second produce the reusable ARM templates, associate parameters, and necessary scripts to be leveraged across future deployment stages (azure environments). This stage will only require one job, the job to build and publish the Data Factory which we outlined above. 

adf_build_stage.yml 

 

 

parameters: - name: serviceName type: string default: 'SampleApp' - name: packageDir type: string default: '$(Build.Repository.LocalPath)/adf_scripts' - name: adfDir type: string default: '$(Build.Repository.LocalPath)/adf' - name: baseEnv default: 'dev' - name: baseRegion default: 'eus' stages: - stage: '${{ parameters.serviceName }}_build' variables: - template: ../variables/azure_global_variables.yml - template: ../variables/azure_${{ parameters.baseEnv }}_variables.yml jobs: - template: ../jobs/adf_build_job.yml parameters: environmentName: ${{ parameters.baseEnv }} dataFactoryResourceID: '/subscriptions/${{ variables.azureSubscriptionID }}/resourceGroups/${{ variables.resourceGroupAbrv }}-${{ parameters.serviceName }}-${{ parameters.baseEnv }}-${{ parameters.baseRegion }}/providers/Microsoft.DataFactory/factories/${{ variables.dataFactoryAbrv }}-${{ parameters.serviceName }}-${{ parameters.baseEnv }}-${{ parameters.baseRegion }}' serviceName: ${{ parameters.serviceName }} regionAbrv: ${{ parameters.baseRegion }} packageDir: ${{ parameters.packageDir }} adfDir: ${{ parameters.adfDir }}

 

 

 

This stage will require some arguments, which in this case, are being treated as default parameters. The reason a defaulted parameter is being leveraged vs a variable is that a default parameter will still give any calling pipeline the ability to override these values; however, still maintain these values as optional. It is extremely helpful if all Data Factories follow a pattern for the folders. In this case ‘adf’ is the folder mapped in Data Factory for source control repository and ‘adfscripts’ is where the ‘package.json’ will live as well as the parameter files for the various environments. 

 

Parameter Name 

Definition 

 

serviceName 

Name that will be used for UI description and the environment agnostic name of the data factory 

 

packageDir 

Where the package.json file is that is required for the npm task 

 

adfDir 

The directory which Azure Data Factory is mapped to in the repo 

 

baseEnv 

The NPM package requires a running instance of datafactory. This is being addressed by pointing to which environment to use. 

 

baseRegion 

The NPM package requires a running instance of datafactory. This is being addressed by pointing to which region to use. 

 

 

 

ARM Template Parameters 

Hopefully you have followed the steps outlined in Part 3 on how to create and store a separate parameter file for each environment. 

 

Deployment Stage 

Templating a deployment stage has its own art. A build stage, by definition, is structured so that it will always run once and generate an artifact. A deployment stage template will run more than once. Ultimately, we want to define the steps once and run against multiple environments.  

 

A seasoned professional, with pipeline experience, may chime in here and point out that there are certain steps, think load testing potentially, that will only run in a test environment and not a dev or production environment. They are correct, I want to acknowledge that.  There is a way to accommodate this; however, I will not be covering it here. 

 

To recap we will want our deployment stage to execute a job with the following tasks: 

  1. Run the PrePostDeploymentScript.ps1 with the parameters required to stop the data factory triggers. 
  2. Deploy the ARM template. 
  3. Run the PrePostDeploymentScript.ps1 with the parameters required to start the data factory triggers. 

Variables 

Unlike the build template we are going to want to leverage variables template files that will be targeted to a specific environment. An in-depth review on how to use variable templates was covered in a previous post on the YAML Pipeline series. 

 

The abbreviated version is across all Azure pipelines there will be variables scoped to the specific environment. These would be certain items such as the service connection name, a subscription id, or a shared key vault.  These variables can be stored in a variable template file in our YAML template repository and loaded as part of the individual deployment stage. 

 

Deployment Tasks 

Same as the build tasks. These steps will need to be scoped to the individual level to optimize reuse for process outside of Data Factory. 

azpwsh_file_execute_task.yml 

 

 

parameters: - name: azureSubscriptionName type: string - name: scriptPath type: string - name: ScriptArguments type: string default: '' - name: errorActionPreference type: string default: 'stop' - name: FailOnStandardError type: boolean default: false - name: azurePowerShellVersion type: string default: 'azurePowerShellVersion' - name: preferredAzurePowerShellVersion type: string default: '3.1.0' - name: pwsh type: boolean default: false - name: workingDirectory type: string - name: displayName type: string default: 'Running Custom Azure PowerShell script from file' steps: - task: AzurePowerShell@5 displayName: ${{ parameters.displayName }} inputs: scriptType: 'FilePath' ConnectedServiceNameARM: ${{ parameters.azureSubscriptionName }} scriptPath: ${{ parameters.scriptPath }} ScriptArguments: ${{ parameters.ScriptArguments }} errorActionPreference: ${{ parameters.errorActionPreference }} FailOnStandardError: ${{ parameters.FailOnStandardError }} azurePowerShellVersion: ${{ parameters.azurePowerShellVersion }} preferredAzurePowerShellVersion: ${{ parameters.preferredAzurePowerShellVersion }} pwsh: ${{ parameters.pwsh }} workingDirectory: ${{ parameters.workingDirectory }}

 

 

This art in this task is realizing we want to make it generic enough for Data Factory to use the same task to execute the PrePostDeploymentScript.ps1 to disable/reenable triggers. While we are at it, we also want to make sure this task can execute any PowerShell script we provide it. 

 

ado_ARM_deployment_task.yml 

 

 

parameters: - name: deploymentScope type: string default: 'Resource Group' - name: azureResourceManagerConnection type: string default: '' - name: action type: string default: 'Create Or Update Resource Group' - name: resourceGroupName type: string default: '' - name: location type: string default: eastus - name: csmFile type: string default: '' - name: overrideParameters type: string default: '' - name: csmParametersFile type: string default: '' - name: deploymentMode type: string default: 'Incremental' steps: - task: AzureResourceManagerTemplateDeployment@3 inputs: deploymentScope: ${{ parameters.deploymentScope }} azureResourceManagerConnection: ${{ parameters.azureResourceManagerConnection }} action: ${{ parameters.action }} resourceGroupName: ${{ parameters.resourceGroupName }} location: ${{ parameters.location }} csmFile: '$(Agent.BuildDirectory)/${{ parameters.csmFile }}' csmParametersFile: '$(Agent.BuildDirectory)/${{ parameters.csmParametersFile }}' overrideParameters: ${{ parameters.overrideParameters }} deploymentMode: ${{ parameters.deploymentMode }}

 

 

This ARM template deployment task should be able to handle any ARM deployment in your environment. It is not tied to just Data Factory as it will accept the ARM template, parameters file, and any override parameters.  Additionally for those unaware AzureResourceManagerTemplateDeployment@3  supports bicep deployments if Azure CLI > 2.2 is available.

 

Deployment Job

Now that the tasks have been created, we will need to orchestrate them in a job. Our job will need to load the specific environment variables required for our deployment. 

adf_deploy_env_job.yml 

 

 

parameters: - name: environmentName type: string - name: serviceName type: string - name: regionAbrv type: string - name: location type: string default: 'eastus' - name: templateFile type: string - name: templateParametersFile type: string - name: overrideParameters type: string default: '' - name: artifactName type: string default: 'ADFTemplates' - name: stopStartTriggersScriptName type: string default: 'PrePostDeploymentScript.ps1' - name: workingDirectory type: string default: '../' jobs: - deployment: '${{ parameters.serviceName }}_infrastructure_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}' environment: ${{ parameters.environmentName }} variables: - template: ../variables/azure_${{parameters.environmentName}}_variables.yml - template: ../variables/azure_global_variables.yml - name: deploymentName value: '${{ parameters.serviceName }}_infrastructure_${{ parameters.environmentName }}_${{ parameters.regionAbrv }}' - name: resourceGroupName value: '${{ variables.resourceGroupAbrv }}-${{ parameters.serviceName }}-${{ parameters.environmentName }}-${{ parameters.regionAbrv }}' - name: dataFactoryName value: '${{ variables.dataFactoryAbrv }}-${{ parameters.serviceName }}-${{ parameters.environmentName }}-${{ parameters.regionAbrv }}' - name: powerShellScriptPath value: '../${{ parameters.artifactName }}/${{ parameters.stopStartTriggersScriptName }}' - name: ARMTemplatePath value: '${{ parameters.artifactName }}/${{ parameters.templateFile }}' strategy: runOnce: deploy: steps: - template: ../tasks/azpwsh_file_execute_task.yml parameters: azureSubscriptionName: ${{ variables.azureServiceConnectionName }} scriptPath: ${{ variables. powerShellScriptPath }} ScriptArguments: '-armTemplate "${{ variables.ARMTemplatePath }}" -ResourceGroupName ${{ variables.resourceGroupName }} -DataFactoryName ${{ variables.dataFactoryName }} -predeployment $true -deleteDeployment $false' displayName: 'Stop ADF Triggers' workingDirectory: ${{ parameters.workingDirectory }} - template: ../tasks/ado_ARM_deployment_task.yml parameters: azureResourceManagerConnection: ${{ variables.azureServiceConnectionName }} resourceGroupName: ${{ variables.resourceGroupName }} location: ${{ parameters.location }} csmFile: ${{ variables.ARMTemplatePath }} csmParametersFile: '${{ parameters.artifactName }}/parameters/${{ parameters.environmentName }}.${{ parameters.regionAbrv }}.${{ parameters.templateParametersFile }}.json' overrideParameters: ${{ parameters.overrideParameters }} - template: ../tasks/azpwsh_file_execute_task.yml parameters: azureSubscriptionName: ${{ variables.azureServiceConnectionName }} scriptPath: ${{ variables. powerShellScriptPath }} ScriptArguments: '-armTemplate "${{ variables.ARMTemplatePath }}" -ResourceGroupName ${{ variables.resourceGroupName }} -DataFactoryName ${{ variables.dataFactoryName }} -predeployment $false -deleteDeployment $true' displayName: 'Start ADF Triggers' workingDirectory: ${{ parameters.workingDirectory }}

 

 

If you notice the parameters for this job is a bit of a combination of the parameters required for each task as well as what’s required to load the variable template file for the specified environment. 

 

Deployment Stage

The deploy stage template should call the deploy job template. To help consolidate, one thing I like to do in the stage template is to make it flexible so that the template can build to 1 or x environments. To achieve this, we pass in an Azure DevOps object containing the list of environments and regions we are wanting to deploy to. There is a more detailed article on how to go about this. 

One note that I am going to call out here is I have included an option to load a job template to deploy Azure Data Factory via Linked ARM templates. This will be covered in a follow up post, for now this section can be ignored and/or omitted. 

 

 

parameters: - name: environmentObjects type: object default: environmentName: 'dev' regionAbrvs: ['cus'] - name: environmentName type: string default: '' - name: templateParametersFile type: string default: 'parameters' - name: serviceName type: string default: '' - name: linkedTemplates type: boolean default: false stages: - ${{ each environmentObject in parameters.environmentObjects }} : - ${{ each regionAbrv in environmentObject.regionAbrvs }} : - stage: '${{ parameters.serviceName }}_${{ environmentObject.environmentName}}_${{regionAbrv}}_adf_deploy' variables: - name: templateFile ${{ if eq(parameters.linkedTemplates, false)}} : value: 'ARMTemplateForFactory.json' ${{ else }} : value: 'linkedTemplates/ArmTemplate_master.json' jobs: - ${{ if eq(parameters.linkedTemplates, false)}} : - template: ../jobs/adf_deploy_env_job.yml parameters: environmentName: ${{ environmentObject.environmentName }} templateFile: ${{ variables.templateFile }} templateParametersFile: ${{ parameters.templateParametersFile }} serviceName: ${{ parameters.serviceName}} regionAbrv: ${{ regionAbrv }} - ${{ else }} : - template: ../jobs/adf_linked_template_deploy_env_job.yml parameters: environmentName: ${{ environmentObject.environmentName }} templateFile: ${{ variables.templateFile }} templateParametersFile: ${{ parameters.templateParametersFile }} serviceName: ${{ parameters.serviceName}} regionAbrv: ${{ regionAbrv }}

 

 

Pipeline 

At this point we have two stage templates (build and deploy) that will load all the necessary jobs and tasks we require. This pipeline will be stored in a YAML folder in the repository that Data Factory is connected to. In this case it is stored in my repo adf_pipelines_yaml_ci_cd. 

 

adf_pipelines_template.yml 

 

 

parameters: - name: environmentObjects type: object default: - environmentName: 'dev' regionAbrvs: ['eus'] locations: ['eastus'] - environmentName: 'tst' regionAbrvs: ['eus'] locations: ['eastus'] - name: serviceName type: string default: 'adfdemo' stages: - template: stages/adf_build_stage.yml@templates parameters: serviceName: ${{ parameters.serviceName }} - ${{ if eq(variables['Build.SourceBranch'], 'refs/heads/main')}}: - template: stages/adf_deploy_stage.yml@templates parameters: environmentObjects: ${{ parameters.environmentObjects }} serviceName: ${{ parameters.serviceName }}

 

 


If one is astute, you notice that I am using trunk based development. This means that every PR will run the CI which will generate a build artifact and confirm the Data Factory template is still valid. Any commitment to the main branch will trigger a deployment. This `Build.SourceBranch` variable will determine if the deployment stage is loaded or not.   

 

Conclusion

That's it for part 4. At this stage we've covered how to break down our Azure DevOps Pipeline for Data Factory into reusable components. We've also created a template that we can define our deployment environments once and scale infinitely amount of times. We also introduced one methodology for YAML templates that accomplishes:

  • Write once reuse across projects. 
  • Individual components can be reused. 
  • Limited manual intervention. 
  • Easily updated. 
  • Centralized definition. 

If interested in this topic feel free to read more on the YAML Pipelines or on CI/CD for Data Factory.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.