This post has been republished via RSS; it originally appeared at: Azure Compute articles.
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting more optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency.
As of March 1, 2020 AR is enabled on the following VM families:
- HBv2
- NDrv2
Over the next several weeks, AR will also roll out to the following VM families:
- HB
- HC
In this post, we discuss AR configuration in Azure HPC clusters and its implication on MPI libraries and communication runtimes based on InfiniBand.
AR and Legacy MPIs/Communication Runtimes
Adaptive Routing allows network packets to use different network routes, which can result in out-of-order packet arrivals. Certain protocols/optimizations in MPI libraries and communication runtimes assume an in-order arrival of network packets. Although this assumption is not valid as per the InfiniBand specification from the InfiniBand Trade Association, Mellanox’s InfiniBand implementations in recent years have nonetheless guaranteed in-order arrival of network packets.
With the arrival of AR capability on modern Mellanox InfiniBand switches, however, in-order packet arrival is no longer guaranteed as the switches may change packet ordering in order to provide optimal network paths and flows.
As a result, protocols/optimizations that rely on the assumption of in-order packet arrival assumption may produce errors previously not seen. Such protocols/optimizations include rdma-exchange protocol in IntelMPI (2018 and before), RDMA-Fast-Path in MVAPICH2, eager-rdma protocol in OpenMPI (openib btl), etc. Please note that this issue arises when these protocols are used for messages larger than one InfiniBand MTU size.
Fear not, though. In Azure HPC, we consider all legacy MPI's and runtimes, and we configure AR in such a way that legacy MPIs and runtimes are well-supported.
AR and Service Levels
Adaptive Routing is enabled per Service Level (SL). SL is specified during the InfiniBand Queue Pair (QP) initialization phase of MPI libraries and communication runtimes. A preferred SL can be specified through environment parameters exposed by MPI libraries (for eg: UCX_IB_SL=1, which instructs UCX runtime to use Service Level 1).
Azure HPC AR Configuration
In Azure HPC clusters, Adaptive Routing is enabled on all SL's except SL=0 (i.e. the default SL). This way, all legacy MPIs and communication runtimes work well on the Azure HPC clusters without any modification. MPI Libraries and communication runtimes that are optimized for Adaptive Routing can take advantage of AR by specifying a non-default SL (e.g: SL=1).
The following are environment parameters to specify a non-default SL in various MPI libraries and runtimes. Use of these parameters will enable Adaptive Routing:
UCX:
UCX_IB_SL
For transport-specific SL configuration, use the corresponding environment parameter(s) based on the transport type (RC/DC/UD).
UCX_RC_VERBS_SL
UCX_RC_MLX5_SL
UCX_DC_MLX5_SL
UCX_UD_VERBS_SL
HPC-X (over UCX):
Please refer to UCX environment parameters above
IntelMPI 2018:
DAPL_IB_SL
IntelMPI 2019 (over UCX):
Please refer to UCX environment parameters above
MVAPICH2:
MV2_DEFAULT_SERVICE_LEVEL
OpenMPI (over OpenIB):
MCA parameter btl_openib_ib_service_level
OpenMPI (over UCX):
Please refer to UCX environment parameters above
NCCL:
NCCL_IB_SL