Introducing Snakemake for Azure Batch

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction:

Snakemake is a workflow management system used to create reproducible and scalable data analysis. The workflows are described in a human readable Python based language. Each workflows is composed of many different interdependent tasks, describing the required software and the processes that manipulate the data. Snakemake workflows can be seamlessly scaled to server, cluster, grid and cloud environment without any modification of workflow definitions, which allow them to be automatically deploed to any execution environment.

Until recently, native Azure batch integration was lacking. As of Snakemake 7.28.0, there is now native Azure Batch support.

How does it work:

The new Snakemake Azure integration will allow the execution of workloads in Azure cloud, given an Azure Batch and Storage account.

Snakemake will do the following:

collect workflow dependencies (config files, Snakefiles etc.) and upload to Azure Blob storage

create an Azure Batch compute pool, which automatically pulls supplied base Docker image

autoscale the batch nodes, based on the number of remain tasks to run over a five minute interval

create an Azure Batch job

create an Azure Batch task for each Snakemake job to run. For each task:
- workflow dependencies are downloaded
- data is staged
- software is installed via Conda
- data is moved out

monitor tasks for completion
upon shutdown, delete the Azure Batch pool and jobs

Setup:

First, install the Azure CLI. Then install Azure related dependencies:

conda create -c bioconda -c conda-forge -n snakemake snakemake \ 
msrest azure-batch azure-storage-blob azure-mgmt-batch azure-identity 
conda activate snakemake

Data in Azure Storage

Using this executor typically requires you to start with large data files already in Azure Blob Storage, and then interact with them via Azure Batch. An easy way to do this is to use the azcopy. command line client. For example, here is how we might upload a file to storage using it:

azcopy copy mydata.txt "https://$account.blob.core.windows.net/snakemake-bucket/1/mydata.txt"

Execution

Before you execute you will need to setup the credentials that allow the batch nodes to read and write from blob storage. For the AzBlob storage provider in Snakemake this is done through the environment variables.

Set required env variables:

export AZ_BLOB_PREFIX=<Azure_Blob_name> 
export AZ_BATCH_ACCOUNT_URL="<AZ_BATCH_ACCOUNT_URL>" 
export AZ_BATCH_ACCOUNT_KEY="<AZ_BATCH_ACCOUNT_KEY>" 
export AZ_BLOB_ACCOUNT_URL="<AZ_BLOB_ACCOUNT_URL_with_SAS>"

Now we can run Snakemake using:

snakemake \ 
--default-remote-prefix $AZ_BLOB_PREFIX \ 
--use-conda \ 
--default-remote-provider AzBlob \ 
--envvars AZ_BLOB_ACCOUNT_URL \ 
--az-batch \ 
--container-image snakemake/snakemake \ 
--az-batch-account-url $AZ_BATCH_ACCOUNT_URL

After completion all results including logs can be found in the blob container prefix specified by –default-remote-prefix.

For more details regarding the Snakemake configuration for Azure Batch, as well as a more detailed tutorial, refer to the documentation at this link.

Conclusion:

The support for Snakemake on Azure Batch, expands the range of workflow engines natively supported by Azure and empowers researchers to easily deploy their workflows into the cloud. Allowing the field to easily scale, collaborate and share their work.

Acknowledgments:

We would like to acknowledge the contributions by Jake VanCampen from Earle A. Chiles Research Institute, and Johannes Köster, Development lead of Snakemake, that contributed to native support of Azure Batch.

Leave a Reply Cancel reply