Health checks for HPC workloads on Microsoft Azure

This post has been republished via RSS; it originally appeared at: AzureCAT articles.

Introduction

Many HPC applications are highly parallel and have tightly coupled communication, meaning that during an applications parallel simulation run, all parallel processes must communicate with each other frequently. These types of applications usually perform best when the inter-communication between the parallel processes is done on high bandwidth/low latency networks like InfiniBand. The tightly coupled nature of these applications means that if a single VM is not functioning optimally, then it may cause the job to have an impaired performance. The purpose of these checks/tests is to assist you in quickly identifying a non-optimal node, so it can be excluded from a parallel job. If your job needs an exact number of parallel processes, a slight overprovision is a good practice, just in case you find a few nodes that you need to exclude.

HB and HC SKUs were specifically designed for HPC applications. They have InfiniBand (EDR) networks, high floating-point performance, and high-memory bandwidths. The tests/checks described here are specifically designed for HB and HC SKUs. It is a good practice to run these checks/tests prior to running a parallel job (especially for large parallel jobs).

How to access the test/check scripts 

git clone git@github.com:Azure/azurehpc.git

Note: Scripts will be in the apps/health_checks directory.

Tests/Checks

Check the InfiniBand network

This test is used to identify if there is an unexpected issue with the InfiniBand network. This test runs a network bandwidth test on pairs of compute nodes (one process running on each compute node). A hostfile contains a list of all the nodes to be tested. The pairs of nodes are grouped in a ring format. For example, if the hostfile contained 4 hosts (A, B, C, & D), the 4 node pairs tested would be (A,B),(B,C), (C,D), and (D,A).

A bad node can be identified by a node pair test failing/not running or underperforming (measured network bandwidth << the expected network bandwidth).

Procedure:

Download the osu benchmark suite:
1. http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.1.tar.gz
Build/install osu micro-benchmark suite.
1. module load mpi/mvapich2-2.3.1
2. configure –prefix=/location/you/want/to/install CC=/opt/mvapich2-2.3.1/bin/mpicc CXX=/opt/mvapich2-2.3.1/bin/mpicxx
3. make
4. make install
```
run_ring_osu_bw.sh [/full/path/to/hostlist] [/full/path/to/osu_bw] [/full/path/to/OUTPUT_DIR]
```
1. The first script parameter is the full path to the hostlist, which should have a single hostname or IP address per line.
```
Host1 

Host2 

Host3 
```
2. The second script parameter is the full path to the osu_bw executable that you built in step 2.
3. The third script parameter is the full path to the output directory. This is the location of the resulting output from this test.
4. These pairwise pt-to-pt benchmarks run serially (each test <20s), so the total test time would depend on how many nodes are in the hostlist file.
A number of files will be created for each node-pair tested. An output report will also be generated in the OUTPUT_DIR directory called “osu_bw_report.log_PID”. The second column is IB bandwidth numbers in MB/s (ascending order). Any numbers << 7000 should be reported and removed from your hostlist. (The slowest test results will be at the top of this file.) If any of the node pair tests failed (the file size is zero, or it contains an error), report those nodes and remove them from your hostlist before running your parallel job.
```
10.32.4.211_to_10.32.4.213_osu_bw.log_68076:4194304 7384.99 

10.32.4.248_to_10.32.4.249_osu_bw.log_68076:4194304 7390.99 

10.32.4.142_to_10.32.4.143_osu_bw.log_68076:4194304 7394.00 

10.32.4.174_to_10.32.4.175_osu_bw.log_68076:4194304 7400.52 

10.32.4.194_to_10.32.4.195_osu_bw.log_68076:4194304 7407.01 
```

Check all the compute nodes memory

This test will help identify problematic memory dimms (for example, dimms that are failing or underperforming). The STREAM benchmark is used for this test, which measures the memory bandwidth on each compute node. The STREAM benchmark is run on each compute node in parallel. Bad memory on a compute node is identified by the STREAM benchmark failing or the measured memory bandwidth << expected memory bandwidth.

Procedure:

Get the stream code from www.cs.virginia.edu/stream/

Build stream with the Intel mpi compiler:

icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=3200000000 -mcmodel=large -shared-intel -Ofast -qopenmp

```
WCOLL=hostlist pdsh /path/to/run_stream_bw.sh [/full/path/to/intel/compilervars.sh] [/full/path/to/stream] [/full/path/to/OUTPUT_DIR] 
```
1. The first script parameter “/full/path/to/intel/compilervars.sh” is the location of “compilevar.sh” script in the Intel compiler environment, which will be sourced to set-up the correct Intel compiler environment.
2. The second parameter “/full/path/to/stream” is the full path to the stream executable, which was built in step 2.
3. The third parameter “/full/path/to/OUTPUT_DIR” is the full path to the directory location where the resulting output from running this test will be deposited.
A test summary report can be generated by running this script:
1. ```
report_stream.sh [/full/path/to/OUTPUT_DIR] 
```
The stream test report “stream_report.log_PID” lists the stream benchmark result for each node in ascending order (the slowest results will be at the top of the file). The second column gives the node memory bandwidth in MB/s. For Hb any memory bandwidth <<~220 GB/s (and for Hc memory bandwidth << 180 GB/s) should be reported and removed from your hostlist. Any node on which this tests fails should also be reported and removed from your hostlist.
```
cgbbv300c00009L/stream.out_27138:Triad: 231227.8 0.084653 0.083035 0.086457 

cgbbv300c00009B/stream.out_27363:Triad: 233946.3 0.084680 0.082070 0.095031 

cgbbv300c0000BR/stream.out_28519:Triad: 234140.8 0.083516 0.082002 0.084803 

cgbbv300c00009O/stream.out_26951:Triad: 234578.7 0.082362 0.081849 0.083965 

cgbbv300c00009U/stream.out_27276:Triad: 234736.0 0.083303 0.081794 0.086764 
```

Summary

A single parallel HPC workload may require many compute nodes for the job to complete in a reasonable time. If one of the compute nodes is configured incorrectly or has a sub-par performance, then it may impact the overall performance for a parallel job. We have some checks/tests which will help identify the problems, but to ensure your nodes are configured correctly, we strongly recommend you run these checks/tests before running any large parallel job.

Introduction

Tests/Checks

Check the InfiniBand network

Procedure:

Check all the compute nodes memory

Procedure:

Summary

Leave a Reply Cancel reply