Site icon TheWindowsUpdate.com

NCCL Performance Impact with PCIe Relaxed Ordering

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction

 

Over the past several years, neural networks have proven to be an incredibly effective tool in the field of Artificial Intelligence. As the problems/tasks become more complex and creative, the training of the neural network model inevitably involves a massive amount of data and computational resources. These require large clusters of multi-GPUs to process, which brings significantly more inter-node communications.  

 

NCCL library provides inter-GPU communication primitives that are topology-aware and can be easily integrated into applications. To deliver optimal inter-GPU performance, GPUDirect RDMA technology has been commonly utilized. It provides direct communication between NVIDIA GPUs in remote systems, which bypasses the system CPUs and eliminates the required buffer copies of data via the system memory, resulting in a significant performance boost [1].

 

GPUDirect RDMA in a virtualized environment requires enablement of ATS (Address Translation Services) on the network adapter [2]. ATS extends the PCIe protocol to support an address translation agent (TA) that translates DMA addresses to cached addresses in the device. The Address Translation Cache (ATC) located in the device reduces the processing load on the translation agent, enhancing system performance [3]. As more memory transactions are generated with ATS, PCIe relaxed ordering consequently plays an important role here.  

 

PCIe Relaxed Ordering

 

PCI Express supports the Relaxed Ordering (RO) mechanism introduced by PCI-X. The concept of Relaxed Ordering in the PCI Express environment allows switches in the path between the Producer and Consumer to reorder some transactions just received before others that were previously enqueued.

 

The ordering rules that exist to support the Producer/Consumer model may result in transactions being blocked, when in fact the blocked transactions are completely unrelated to any Producer/Consumer transaction sequence. Consequently, in certain circumstances, a transaction with its Relaxed Ordering attribute bit set can be re-ordered ahead of other transactions [4]. 

 

As a PCIe feature, Relaxed Ordering allows flexibility in the transaction order over PCIe. This reduces the number of retransmissions on the lane and can greatly help the performance of InfiniBand networks in virtualized environments.

 

In this blog, we will demonstrate the performance impact of PCI Relaxed Ordering with NCCL Allreduce benchmark across two VMs on Azure.

 

Experiment Setup

 

 

Experiment results

 

Since NCCL-2.12, an environment variable NCCL_IB_PCI_RELAXED_ORDERING has been introduced, which can enable/disable PCIe Relaxed Ordering for IB Verbs transport directly. This variable by default is set to automatically use Relaxed Ordering if available. Azure HPC images already have NCCL-2.12 or higher prebuilt in the images. We can easily turn it on/off to check NCCL performance impact.

 

RO Disabled (NCCL_IB_PCI_RELAXED_ORDERING=0)

 

 

mpirun -np 16 --map-by ppr:8:node -hostfile hostfile  \
       -mca coll_hcoll_enable 0 --bind-to numa \
       -x NCCL_IB_PCI_RELAXED_ORDERING=0 \
       -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
       -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
       -x NCCL_DEBUG=WARN \
       /opt/nccl-tests/build/${TEST} -b 8 -e 8G -f 2 -g 1 -c 0

 

 

Fig.1 NCCL Allreduce performance w. RO for IB Verbs transport disabled

 

RO Enabled (NCCL_IB_PCI_RELAXED_ORDERING=1)

 

 

mpirun -np 16 --map-by ppr:8:node -hostfile hostfile  \
       -mca coll_hcoll_enable 0 --bind-to numa \
       -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
       -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
       -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
       -x NCCL_DEBUG=WARN \
       /opt/nccl-tests/build/${TEST} -b 8 -e 8G -f 2 -g 1 -c 0

 

 

 

Fig.2 NCCL Allreduce performance w. RO for IB Verbs transport enabled

 

As we can see, the maximum in-place busbw with RO disabled is only 25GB/s (at 2M message size) in Fig.1, however it can deliver 188 GB/s in-place busbw with RO enabled, as shown in Fig.2. It brings more than 6X performance improvement with RO enabled.

 

Other options to control PCI Relaxed Ordering

 

For NCCL that is lower than version 2.12, the environment variable NCCL_IB_PCI_RELAXED_ORDERING is not available, but there are two ways to control PCI Relaxed Ordering through NCCL IB plugin. It’s worth noting that both options have been integrated in Azure HPC images.

  1. Build nccl-rdma-sharp-plugins from source and add the built library into LD_LIBRARY_PATH. In Azure HPC image, it’s already located under /usr/local/nccl-rdma-sharp-plugins. One can add its library through LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH
  2. Use NCCL IB plugin shipped via HPC-X and include the plugin via module load command. In Azure HPC image, it’s already located under /opt/hpcx-v2.11-gcc-MLNX_OFED_LINUX-5-ubuntu18.04-cuda11-gdrcopy2-nccl2.11-x86_64/nccl_rdma_sharp_plugin. One can include the plugin through module load mpi/hpcx

 

Summary

 

This blog demonstrates that NCCL performance can be greatly boosted with PCI Relaxed Ordering enabled. Azure HPC images provide the user with different approaches to enable PCI relaxed ordering via environment variable, either by newer version of NCCL (2.12 onwards) or NCCL IB plugin.  

 

Reference

 

[1]. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html

[2]. RDG: Virtualizing GPU-Accelerated HPC & AI Workloads on OpenStack Cloud over InfiniBand Fabric

[3]. Address Translation Services, Revision 1.1 (composter.com.ua) 

[4]. PCI Express System Architecture by Tom Shanley, Don Anderson, Ravi Budruk, MindShare, Inc

 

#AzureHPCAI

 

Exit mobile version