Update 1: Evaluating Genomics Pipelines on Azure: Intel-based Virtual Machines

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Authors: Venkat S Malladi, Jer-Ming Chia PhD, Priyanka Sebastian, Keith, Michael J Mcmanus, Paolo Narvaez

Background

The Genomics Analytics Toolkit (GATK), developed and maintained by the Broad Institute of MIT and Harvard, is a commonly used set of tools for processing genomics data. Along with the toolkit the Broad Institute publishes Best Practices pipelines, a collection of workflows for different high-throughput sequencing data analyses that serves as reference implementations.

To understand how the choice of Virtual Machine (VM) families affect the cost and performance profiles of GATK pipelines, the Solutions team at Intel, supported by the Azure High-Performance Computing team and Biomedical Platform team in Microsoft Health Futures, profiled a GATK pipeline on different Intel-based VM families on Azure. We previously helped users reduce cost by 30% from $13 to less than $9 per sample simply by setting the workflow to run exclusively on VMs with the latest generation Intel processors, see here. For more details on this pipeline, see here.

The pipeline consists of 28 tasks, each with distinct compute requirements. Of these 28 tasks, six get distributed as multiple jobs or, on Azure, on multiple VMs. As an example, SortSampleBam is executed on a single VM, while ApplyBQSR is distributed across 19 VMs.

Table 1 below provides a summary of the tasks in the pipeline, the number of VMs each task is distributed across, and the minimum number of vCPUs, DRAM, and disk required per VM for that task. Summing up all the VMs across all the tasks, there are a total of 181 VMs automatically orchestrated throughout the execution of the pipeline.

As stated before, optimal cost efficiency for workloads on Azure requires (1) efficient orchestration of these tasks; (2) ensuring each task is allocated the right-size VM; and (3) choosing the best VM series.

Table 1

GATK Pipelines on Azure

As we did before, we utilized Cromwell on Azure built and maintained by Microsoft. Cromwell on Azure is an open-source project that configures all Azure resources needed to run Cromwell workflows on Microsoft Azure, and implements the GA4GH TES backend for orchestrating tasks that run in Azure Batch.

Figure 1, Ref: https://github.com/microsoft/Cromwell on Azure

With advances to Cromwell on Azure enabling vm_size in the runtime, as well as availability new generation Intel architecture, we decided to revisit our earlier benchmarking to see if we could suggest additional improvements. Three different configurations for the Germline Variant Calling pipeline, each using different sets of VM series, were benchmarked to determine the most performant and the most cost-effective VM series for this workload.

In the “Ddv5”, the least expensive VM in terms of $/hour was selected at for each task in the runtime parameters using vm_size that meet its requirements. Likewise, for “Ddv4,” the smallest Ddv4 VM for each task that met the compute needs for the task was chosen. In the “Lowest Cost Configuration”, the least expensive VM in terms of $/hour was selected at runtime for each task limited to the Ddv5 VM series or F16s VM. The Intel CPU generation for each of the configurations is shown in Table 2.

Table 2

Germline Variant Calling: Price/Performance on Azure

Figure 2 shows the end-to-end runtime for the GATK Best Practices Pipelines for Germline Variant Calling on each of the three configurations tested. The Ddv5 configuration had the fastest runtime at 15.3 hours The Lowest Cost configuration, where the least expensive VM is chosen for each task, results in a 1.06X slower runtime compared to VMs with the newer generation Intel processors. The Ddv4 configuration, ran on VMs with older generation CPUs compared to the Lowest cost configuration, results in a runtime of 17.6 hours, comparable to previous runtime we got for this configuration. The Ddv5 configuration is 13% faster when compared to the Ddv4 configuration.

Figure 2

While the germline variant calling pipeline performed better on Azure VMs with newer generation Intel CPUs, the cost of running the pipeline may hold higher priority for users without time-sensitive workloads. The Lowest Cost configuration chooses VM configurations to optimize for lowest total cost, but remaining relatively performative.

Figure 3 shows the cost to run the entire pipeline on each of the three configurations tested, broken out by cost per individual task. Perhaps counter-intuitively, the Ddv5 configuration with the newest generation CPU has similar $/hour VM as the Ddv4, is the least expensive option. Users can reduce cost by 22% from $8.15 to $6.36 simply by setting the workflow to run on the Ddv5 VM series instead of choosing the Ddv4 series.

Figure 3

The biggest cost savings is with SamtofastqBWAMemandMba. With the Lowest Cost configuration, SamtofastqBWAMemandMba runs on Fsv2 machines. As shown in Figure 4, this results in a cost of $2.83 specifically for this one task. When running with Fsv2, this cost is reduced to $1.74.

Figure 5 shows the performance in core hours consumed (A) and max shard runtime (B) for the five most expensive tasks in the pipeline. Figure 5A shows the sum of all core hours consumed across all VMs for a given task, while Figure 5B shows the maximum VM runtime of all VMs for that task. It is clear from both dimensions that the Ddv5 VM family is notably more performant than the Ddv4 family. Overall, Ddv5 has faster completion time for Max shard and less Core hours per task when compared to Ddv4.

Figure 4

Figure 5

Figures 4 and 5 also breaks down the cost and performance for SamToFastqAndBwaAndMba (BWA) on the Fsv2 VM series. This CPU-bound task has the best cost/performance profile when running on the Compute Intensive amongst the latest generation Intel CPUs available on Azure.

Conclusion

GATK Best Practices Pipeline for Germline Variant Calling is a critical genomics analytics workload and is broadly used in the fields of precision medicine and drug discovery.

Notably, running this pipeline on the Ddv5 Azure VM series with 3^rd Generation Intel® Xeon® Scalable Processors results in both better performance and lower total cost when compared to running on the Ddv4. Further cost savings can be achieved by specifically selecting the Fsv2 VM series for running the CPU-bound BWA-based tasks and using the Lowest Cost config.

This work illustrates that when performance improvements are proportionally greater than cost savings (in terms of $/core-hour), one can choose the optimal VM for each task's needs resulting in optimal cost and time solution.

Directly attributable to this work, Microsoft has contributed modifications to the Task Execution Service (TES) schema of the Global Alliance for Genomics and Health project to accept additional runtime attributes for each task. This will enable Cromwell on Azure to specify VM sizes for each task in a workflow and provides users the ability to achieve optimal performance and cost efficiency for their critical genomics workloads on Azure.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software, or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Configuration Details

CLX: Intel Ddv4; Test By: Intel as of September 19, 2022; CSP/Region: Azure uswest2; Virtual Machine Family: Ddv4; #vCPUs: {2, 4, 8, 16}; Number of Instances or VMs: 181; Iterations and result choice: Three; Median; CPU: 8272CL; Memory Capacity / Instance (GB): {8, 16, 32, 64}; Storage per instance (GB): {75, 150, 300, 600}; Network BW per instance (Mbps) (read/write): {1000, 2000, 4000, 8000}; Storage BW per VM (read/write) (Mbps): {120, 242, 485, 968}; OS: Linux; Workload and version: https://console.cloud.google.com/gcr/images/broad-gotc-prod/US/genomes-in-the-cloud; Libraries: GATK 4.0.10.1, GKL 0.8.6, Cromwell 70, Samtools 1.3.1; WL-specific details: Cromwell with Azure Batch v 2.5.0; https://github.com/microsoft/CromwellOnAzure

ICX: Intel Ddv5; Test By: Intel as of September 19, 2022; CSP/Region: Azure uswest2; Virtual Machine Family: Ddv5; #vCPUs: {2, 4, 8, 16}; Number of Instances or VMs: 181; Iterations and result choice: Three; Median; CPU: 8370C CPU; Memory Capacity / Instance (GB): {8, 16, 32, 64}; Storage per instance (GB): {75, 150, 300, 600}; Network BW per instance (Mbps) (read/write): {12500}; Storage BW per VM (read/write) (Mbps): {125, 250, 500, 1000}; OS: Linux; Workload and version: https://console.cloud.google.com/gcr/images/broad-gotc-prod/US/genomes-in-the-cloud; Libraries: GATK 4.0.10.1, GKL 0.8.6, , Samtools 1.3.1; WL-specific details: Cromwell with Azure Batch; https://github.com/microsoft/CromwellOnAzure

SKX: Intel Fsv2; Test By: Intel as of September 19, 2022; CSP/Region: Azure uswest2; Virtual Machine Family: Fsv2; #vCPUs: {16}; Number of Instances or VMs: 24; Iterations and result choice: Three; Median; CPU: 8272CL CPU; Memory Capacity / Instance (GB): {32,}; Storage per instance (GB): 128}; Network BW per instance (Mbps) (read/write): {12500}; Storage BW per VM (read/write) (Mbps): {380-800}; OS: Linux; Workload and version: https://console.cloud.google.com/gcr/images/broad-gotc-prod/US/genomes-in-the-cloud; Libraries: GATK 4.0.10.1, GKL 0.8.6, Cromwell 73, Samtools 1.3.1; WL-specific details: Cromwell with Azure Batch v 3.1.0; https://github.com/microsoft/CromwellOnAzure