This post has been republished via RSS; it originally appeared at: AzureCAT articles.
Introduction
HB and HC SKUs were specifically designed for HPC applications. They have InfiniBand (EDR) networks, high floating-point performance, and high-memory bandwidths. The tests/checks described here are specifically designed for HB and HC SKUs. It is a good practice to run these checks/tests prior to running a parallel job (especially for large parallel jobs).
How to access the test/check scripts
git clone git@github.com:Azure/azurehpc.git
Note: Scripts will be in the apps/health_checks directory.
Tests/Checks
Check the InfiniBand network
This test is used to identify if there is an unexpected issue with the InfiniBand network. This test runs a network bandwidth test on pairs of compute nodes (one process running on each compute node). A hostfile contains a list of all the nodes to be tested. The pairs of nodes are grouped in a ring format. For example, if the hostfile contained 4 hosts (A, B, C, & D), the 4 node pairs tested would be (A,B),(B,C), (C,D), and (D,A).
A bad node can be identified by a node pair test failing/not running or underperforming (measured network bandwidth << the expected network bandwidth).
Procedure:
- Download the osu benchmark suite:
- Build/install osu micro-benchmark suite.
- module load mpi/mvapich2-2.3.1
- configure –prefix=/location/you/want/to/install CC=/opt/mvapich2-2.3.1/bin/mpicc CXX=/opt/mvapich2-2.3.1/bin/mpicxx
- make
- make install
-
run_ring_osu_bw.sh [/full/path/to/hostlist] [/full/path/to/osu_bw] [/full/path/to/OUTPUT_DIR]
- The first script parameter is the full path to the hostlist, which should have a single hostname or IP address per line.
Host1
Host2
Host3 - The second script parameter is the full path to the osu_bw executable that you built in step 2.
- The third script parameter is the full path to the output directory. This is the location of the resulting output from this test.
- These pairwise pt-to-pt benchmarks run serially (each test <20s), so the total test time would depend on how many nodes are in the hostlist file.
- The first script parameter is the full path to the hostlist, which should have a single hostname or IP address per line.
- A number of files will be created for each node-pair tested. An output report will also be generated in the OUTPUT_DIR directory called “osu_bw_report.log_PID”. The second column is IB bandwidth numbers in MB/s (ascending order). Any numbers << 7000 should be reported and removed from your hostlist. (The slowest test results will be at the top of this file.) If any of the node pair tests failed (the file size is zero, or it contains an error), report those nodes and remove them from your hostlist before running your parallel job.
10.32.4.211_to_10.32.4.213_osu_bw.log_68076:4194304 7384.99
10.32.4.248_to_10.32.4.249_osu_bw.log_68076:4194304 7390.99
10.32.4.142_to_10.32.4.143_osu_bw.log_68076:4194304 7394.00
10.32.4.174_to_10.32.4.175_osu_bw.log_68076:4194304 7400.52
10.32.4.194_to_10.32.4.195_osu_bw.log_68076:4194304 7407.01
Check all the compute nodes memory
This test will help identify problematic memory dimms (for example, dimms that are failing or underperforming). The STREAM benchmark is used for this test, which measures the memory bandwidth on each compute node. The STREAM benchmark is run on each compute node in parallel. Bad memory on a compute node is identified by the STREAM benchmark failing or the measured memory bandwidth << expected memory bandwidth.
Procedure:
- Get the stream code from www.cs.virginia.edu/stream/
- Build stream with the Intel mpi compiler:
-
icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=3200000000 -mcmodel=large -shared-intel -Ofast -qopenmp
-
-
WCOLL=hostlist pdsh /path/to/run_stream_bw.sh [/full/path/to/intel/compilervars.sh] [/full/path/to/stream] [/full/path/to/OUTPUT_DIR]
- The first script parameter “/full/path/to/intel/compilervars.sh” is the location of “compilevar.sh” script in the Intel compiler environment, which will be sourced to set-up the correct Intel compiler environment.
- The second parameter “/full/path/to/stream” is the full path to the stream executable, which was built in step 2.
- The third parameter “/full/path/to/OUTPUT_DIR” is the full path to the directory location where the resulting output from running this test will be deposited.
- A test summary report can be generated by running this script:
-
report_stream.sh [/full/path/to/OUTPUT_DIR]
-
-
The stream test report “stream_report.log_PID” lists the stream benchmark result for each node in ascending order (the slowest results will be at the top of the file). The second column gives the node memory bandwidth in MB/s. For Hb any memory bandwidth <<~220 GB/s (and for Hc memory bandwidth << 180 GB/s) should be reported and removed from your hostlist. Any node on which this tests fails should also be reported and removed from your hostlist.
cgbbv300c00009L/stream.out_27138:Triad: 231227.8 0.084653 0.083035 0.086457
cgbbv300c00009B/stream.out_27363:Triad: 233946.3 0.084680 0.082070 0.095031
cgbbv300c0000BR/stream.out_28519:Triad: 234140.8 0.083516 0.082002 0.084803
cgbbv300c00009O/stream.out_26951:Triad: 234578.7 0.082362 0.081849 0.083965
cgbbv300c00009U/stream.out_27276:Triad: 234736.0 0.083303 0.081794 0.086764
Summary
A single parallel HPC workload may require many compute nodes for the job to complete in a reasonable time. If one of the compute nodes is configured incorrectly or has a sub-par performance, then it may impact the overall performance for a parallel job. We have some checks/tests which will help identify the problems, but to ensure your nodes are configured correctly, we strongly recommend you run these checks/tests before running any large parallel job.