How to identify the recommended VM for your HPC workloads

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Azure offers access to 7 distinct VM categories, including Compute Optimized, Memory Optimized, and Accelerated Compute, comprising over 50 different families such as Fsv2-series, Edv4-series, and ND A100 v4-series. Each family has a variety of SKUs available, resulting in a wide range of choices. Determining the optimal VM for your workload can be a challenging task without clear guidance on how to filter out unsuitable VM categories and families.

 

This article presents a concise overview of the key factors to consider when selecting the appropriate SKU for your application. It outlines a systematic methodology for filtering out unsuitable VM categories, then narrowing down the options by evaluating VM families and SKUs.

 

Identify your workload requirements

The first step in identifying the recommended VM for your HPC workload is to understand your workload requirements. This includes the number of cores, memory, storage, and network bandwidth that your workload requires. You can do this by analyzing your existing workload or by estimating the requirements based on the application that you plan to run.

 

While the recommendations outlined above may seem straightforward, it's not uncommon to encounter projects where the necessary information is unavailable or unclear. In such cases, it may be necessary to perform additional research or consult with experts in the field to obtain the information needed. It's important to take the time to gather all the necessary information before making a decision to avoid potential issues down the line.

 

With this information, the initial question we need to have an answer for is:

 

  • Does my application requires GPU computation or only CPU?

When a GPU is required, the search should be narrowed down to the Accelerated Compute VM category across the various N families available on Azure. These families offer several generations of NVIDIA and AMD GPUs suitable for a range of scenarios, including 3D visualization, number crunching, and machine learning training and inference.

 

In many cases, the specific GPU required for a particular workload is already known. The primary task is then to identify the family that supports that GPU and determine how many cards are needed in your SKU to meet your workload's requirements.

When a CPU based VM is required, the analysis continues further with the following questions.

 

  • Which is the ratio between RAM / core required by your application?

Determining the appropriate ratio of memory to cores is a critical decision parameter, as Azure VMs can have varying ratios depending on their category. Typical memory to core ratios include 2GB/core, 4GB/core, and 8GB/core. Lower ratios, such as 1GB/core, may be available on lower-performing VMs, such as those in the B-series, while higher ratios, such as 58GB/core, can be found on constrained sizes within the HX-series

 

  • Which processor type do you need?

In some enterprise solutions, it may be necessary to use a specific processor brand, such as Intel, to be covered under support contracts. In this scenario, the search can be narrowed down to only those VMs that support Intel processors. However, in situations where there is no particular recommendation regarding the processor type, both Intel and AMD processors can be considered.

 

In such cases, it may be necessary to test different families with varying processor configurations to understand how they impact the execution time of your application. By comparing the performance of different VMs with different processor types, you can identify the best option for your specific workload

 

  • How does your application scale?

Applications can have different performance profiles depending on how they were developed. Factors such as whether they are single or multi-threaded, whether their execution time is compute-bound or memory-bound, and whether they require a high level of network throughput to move data in and out of a central repository, or they use local NVMEs or SSDs as scratch storage, can all impact performance.

 

Understanding the ratio between RAM and cores, as well as the performance profile of your application, can help you determine the appropriate VM SKU for your workload. Once you have selected a VM, you may also need to decide whether it's better to run multiple tasks in a single, larger node or distribute them across multiple nodes.

 

Running multiple tasks on a single node can help maximize resources and reduce costs, but you need to ensure that the node has sufficient resources to handle all of the tasks. If the tasks require a high level of CPU or memory resources, or if they are I/O intensive, running them on a single node may not be the most efficient solution. In such cases, it may be better to distribute the tasks across multiple nodes to avoid hitting the limits of the VM.

 

Understand the different VM types

As noted earlier in this article, Azure presents a vast array of VM configurations spanning diverse categories and families. Nonetheless, evaluating each one of them can be a laborious and resource-intensive endeavor. To avoid this, it is crucial to grasp the distinctions between them at a broad level to develop a structured approach that narrows down the search to a manageable set of options.

 

As a general guideline, if you're running large models on tens or hundreds of VMs that require low-latency network communication, the H family of VMs may be the best option. For memory-bound scenarios, the HB family would be preferable over the HC family. If the amount of RAM on the HB family is insufficient, the HX family may be the right size.

On the other hand, if you're running embarrassingly parallel simulations with hundreds of nodes doing individual calculations without intra-node communication, the latest generations of D and E series VMs could provide a good balance between price and performance. If Hyper-Threading is not an option, you should evaluate the H series with a higher number of tasks per node if your application scales well.

 

Define you performance benchmarks

Based on the information provided above, you may already have a good understanding of which SKUs could potentially deliver optimal performance for your application. However, the most reliable way to accurately evaluate performance is by conducting performance benchmarks. These benchmarks can help you determine how your application behaves across different processor types, RAM/core ratios, and scalability options. How can it be done?

 

Imagine the following example: a customer is running a genomics application on-premises and she is interested on moving it to the cloud. She knows the actual hardware where the application is running on-premises but she doesn’t know  which Azure VM could provide the best results regarding execution time, performance and cost.

 

After going through the previous questions, this is the input information available:

  • CPU only application.
  • An actual ratio of 2GB/core is used on-premises. Not sure which is the ideal scenario.
  • Existing infrastructure is based on Intel processors but software can be executed on AMD too.
  • They run one or two jobs per node in their actual configuration on-premises. More than two shows degradation on the total execution time. It’s not clear where the bottleneck is.

Applying the previous general guideline mentioned:

  • Application is not running large models that require low latency communication networks so initially, H series would be discarded.
  • The suggested RAM/Core ratio is aligned with Fv2 and Dv5 families. It’s not clear from the information available  if higher ratios could improve the results. An extra family, Ev5, should be include to check it.
  • No preferences from processor are defined. Both Dav5 (AMD) and Dv5 (Intel), Eav5 (AMD) and Ev5 (Intel) should be included.
  • Application supports some level of scale-out scalability. A specific test running multiple tasks in a single node needs to be included to understand better the limit and identify the bottleneck.

Creating a test matrix like the next one would be a useful approach to systematically evaluate the performance of the identified configurations. An extensive test could take a lot of time and resources, in this scenario, 25 tests should be execute to fulfill the full table with the execution time of the same case on all those SKUs.

 

 

 

 

Task Count

 

 

 

1

2

4

8

16

 

 

 

vCPU Number

Family

Processor

Ram/vCPU

2

4

8

16

32

DAsv5

AMD's 3rd Generation EPYCTM 7763v

4GB/vCPU

X

X

X

X

X

Ddv5

Intel® Xeon® Platinum 8370C (Ice Lake)

4GB/vCPU

-

-

-

-

X

Eav5

AMD's 3rd Generation EPYCTM 7763v

8GB/vCPU

-

-

-

-

X

Esv5

Intel® Xeon® Platinum 8370C (Ice Lake)

8GB/vCPU

-

-

-

-

X

Fv2

Intel® Xeon® Platinum 8168

2GB/vCPU

-

-

-

-

X

 

By executing at least one complete row and one complete column, it's possible to get a general understanding of how the application behaves on different RAM/core ratios and how it scales-out across multiple processors and RAM/core ratios. This can help to reduce the number of tests needed to find the right VM type, family, and SKU, saving time and resources.

 

In this case, row cases allows us to understand how the application behaves on a scale-out scenario with a different RAM/core ratio than on-premises. In the opposite direction, column cases allows us to understand how the application behaves across different processors and RAM/CPU ratios on scale-out situation. It’s possible to select the first column instead of the last one if scale-out results from the first test are not optimal.

 

Execute the benchmark

The next step in the process is to select a representative example of the customer's application that can be used to test the different hardware profiles in the test matrix. The results obtained from this representative example can then be extrapolated to the entire dataset.

For an efficient and straightforward way to conduct testing, it is recommended to create a custom image with all the required software installed and build a Bicep template or AZ CLI script to create the necessary number of virtual machines (VMs). Once the VMs are set up, you can launch the test execution using a Custom Script Extension, which can help reduce overhead when conducting a large number of tests.

 

If your application behavior depends on the size of the example, you should execute the test matrix as many times as sizes identified (i.e. small, medium, large, extra-large).

 

Analyze benchmark results

After the benchmark has been completed, the table above should be populated with the timing information for each job. The results can be interpreted as follows:

  • Minimum execution time: By identifying the box with the shortest execution time, you can determine the optimal combination of VM family, SKU, and number of tasks per node to achieve the fastest possible results.
  • Cost optimization: If you multiply the data in the table by the price per minute or hour of each VM configuration, you can determine the optimal combination to achieve the results at the lowest possible cost.

It is possible that one of the combinations identified may be both the fastest and the cheapest option, in which case you have found the optimal configuration for running your jobs. However, this may not always be the case. When different combinations appear, you should decide whether to optimize for performance or cost based on your specific requirements for each scenario.

 

Final notes

Although the process outlined in this article provides a simplified and systematic approach to identifying the appropriate SKU for your HPC workloads in common scenarios, it may not cover all the different factors that could impact your specific use case. You can consider additional questions at the beginning of the identification process to account for unique requirements. By selecting the appropriate VM families in your test matrix and answering these questions, you can more effectively identify the optimal configuration for your workload.

 

For example, if your application is heavily dependent on input/output data, factors such as network bandwidth or local NVME/SSDs may be more critical than the ratio of RAM to cores. Therefore, it is important to carefully evaluate the unique requirements of your workload and consider additional factors beyond those presented in this article to ensure optimal performance.

 

Good luck with your search! 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.