HPC Performance and Scalability Results with Azure HBv2 VMs

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

(Article contributed by Jon Shelly and Evan Burness, Azure)

 

Just in time for SC’19, Azure launched into Preview this week the new HBv2 Virtual Machines for High-Performance Computing (HPC).

These VMs feature a wealth of new technology, including:

  • AMD EPYC 7742 CPUs (Rome)
  • 2.45 GHz Base clock / 3.3 GHz Boost clock
  • 480 MB L3 cache, 480 GB RAM
  • 340 GB/s of Memory Bandwidth
  • 200 Gbps HDR InfiniBand (SRIOV) with Adaptive Routing
  • 900 GB SSD (NVMeDirect)

 

Below are initial performance characterizations using a variety of configurations on both microbenchmarks as well as commonly used HPC applications for which the HB family of VMs is optimized for.

 

Microbenchmarks

 

MPI Latency

OSU Benchmarks (5.6.2) – osu_latency with MPI = HPC-X, Intel MPI, MVAPICH2, OpenMPI

 

Message Size (bytes)

HPC-X

Intel MPI

MVAPICH2

OpenMPI

0

1.62

1.95

1.85

1.61

1

1.62

1.95

1.9

1.61

2

1.61

1.95

1.9

1.61

4

1.62

1.95

1.9

1.61

8

1.61

1.96

1.9

1.61

16

1.62

1.96

1.93

1.62

32

1.77

1.97

1.94

1.77

64

1.83

2.03

2.08

1.82

128

1.9

2.09

2.29

1.9

256

2.44

2.65

2.78

2.44

512

2.53

2.71

2.84

2.53

1024

2.63

2.84

2.93

2.62

2048

2.92

3.09

3.13

2.92

4096

3.72

3.89

4.07

3.74

 

 

 

MPI Bandwidth (2 QP’s)

OSU Benchmarks (5.6.2) – osu_mbw_mr with ppn = 2 with MPI = HPC-X

 

#bytes

BW peak

BW average

4096

15920.65

15902.67

8192

23045.57

23036.88

16384

23270.14

23260.04

32768

23376.91

23372.9

65536

23423.49

23423.23

131072

23445.05

23443.6

262144

23463.94

23463.93

524288

23470.7

23470.55

1048576

23474.3

23474.08

2097152

23475.77

23475.73

4194304

23476.61

23476.61

8388608

23477.06

23477.05

 

 

Application Benchmarks

 

App: Siemens Star-CCM+

Version: 14.06.004

Model: LeMans 100M Coupled Solver

Configuration Details: 116 MPI ranks were run (4 ranks from each of 29 NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images

VMs

  Cores

  PPN

   SETime

  SpeedUp

  ParallelEff

1

116

116

258.92

116

100

2

232

116

129.56

231.82

99.9

4

464

116

62.01

484.35

104.4

16

1856

116

16.46

1824.71

98.3

32

3712

116

8.4

3575.56

96.3

64

7424

116

4.8

6257.23

84.3

128

14848

116

2.5

12013.89

80.9

 

clipboard_image_0.png

 

Summary: Star-CCM+ was scaled at 81% efficiency to nearly 15,000 MPI ranks delivering an application speedup of more than 12,000x. This compares favorably to Azure’s previous best of more than 11,500 MPI ranks, which itself was a world-record for MPI scalability on the public cloud.

 

 

 

 

App: ANSYS Fluent

Version: 14.06.004

Model: External Flow over a Formula-1 Race Car (f1_racecar_140m)

Configuration Details: 60 MPI ranks were run (2 out of 4 cores per NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes and give ~6 GB/s of memory bandwidth per core. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images

 

VMs

HBv2 Solver Rating

HBv2  Speedup

Linear Ideal Speedup

1

68.5

1

1

2

134.5

1.96

2

4

275.9

4.03

4

8

557.8

8.14

8

16

1122.1

16.38

16

32

2385.1

34.82

32

64

4601.9

67.18

64

128

9846.2

143.74

128

 

 

Summary: HBv2 VMs scale super linearly (112%) up to the top end measured number of VMs (128). The Fluent Solver Rating measured at this top-end level of scale is 83% more performance than the current leader submission on ANSYS public database for this model (https://bit.ly/2OdAExM).

 

 

Impact of Adaptive Routing

 

App: Siemens Star-CCM+

Version: 14.06.004

Model: LeMans 100M Coupled Solver

Configuration Details: Star-CCM+ performance was compared on an “apples to apples” basis, with the only variable as Adaptive Routing being disabled and then enabled.

 

clipboard_image_1.png

 

Summary: Adaptive Routing, designed to drive higher sustained application scalability for large MPI jobs, delivered a scaling efficiency improvement of 17% over an identical job run with the feature disabled. This translates to faster time to solution, and more efficient use of application licenses.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.