This post has been republished via RSS; it originally appeared at: Azure Compute articles.
There has been greater interest in the usage of the HPC optimized VM images that we publish due to the:
- GA of SR-IOV enabled HPC VMs (HB, HC, HB_v2),
- Recent platform update to make NCr_v3 SR-IOV enabled,
- GA of NDr_v2
While those images (CentOS-HPC 7.6, 7.7) are originally targeted for use on the SR-IOV enabled HPC VMs (HB, HC, HB_v2), conceptually, they are useful for the other now SR-IOV enabled GPU VMs (NCr_v3, NDr_v2) too. Note that the GPU VMs would additionally require the Nvidia GPU drivers (VM extension, manually).
Typically we find that users of the HPC VMs running traditional HPC applications tend to utilize CentOS as their preferred OS. While users of AI/ML applications running on the GPU VMs tend to prefer Ubuntu as the OS. The CentOS-HPC VM OS images (>=7.6 for the SR-IOV enabled VMs, and <=7.5 for the non-SR-IOV enabled VMs) provide a ready to use VM image with the appropriate drivers and MPI runtimes. Such a pre-packaged and ready to use experience isn't yet available for Ubuntu on Azure.
This article attempts to consolidate guidance on configuring InfiniBand (IB) for Ubuntu across both SR-IOV and non-SR-IOV enabled HPC and GPU VMs. Specifically it will focus on getting the right drivers setup and in bringing up the appropriate IB interface on the VMs. At the time of writing, the following steps at least apply to Ubuntu 18.04 LTS image by Canonical on the Azure Marketplace.
Non- SR-IOV enabled VMs
-
Install dapl (and its dependencies rdma_cm, ibverbs), and user mode mlx4 library.
sudo apt-get update sudo apt-get install libdapl2 libmlx4-1 -
In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines (root access)
OS.EnableRDMA=y OS.UpdateRdmaDriver=y - Restart the waagent service
sudo systemctl restart walinuxagent.service
The IB interface eth1 should come up with an RDMA IP address.
The IB related kernel modules are not auto-loaded on Ubuntu anymore. This is a departure from earlier practice where the kernel modules were built into the image. Now these are available as loadable modules so that a user can install Mellanox OFED driver.
SR-IOV enabled VMs with inbox driver
-
Load following kernel modules (either mpdprob or edit /etc/modules)
ib_uverbs rdma_ucm ib_umad ib_ipoib -
Reboot VM
sudo reboot
The IB interface ib0 should come up with an RDMA IP address.
SR-IOV enabled VMs with OFED driver
-
The appropriate Mellanox OFED driver an be downloaded and installed as referenced below
wget http://content.mellanox.com/ofed/MLNX_OFED-5.0-1.0.0.0/MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64.tgz tar zxvf MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64.tgz sudo ./MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64/mlnxofedinstall --add-kernel-support -
Load the new driver
sudo /etc/init.d/openibd restart - Assign the RDMA IP address to the ib0 interface
IP=$(sudo sed '/rdmaIPv4Address=/!d;s/.*rdmaIPv4Address="\([0-9.]*\)".*/\1/' /var/lib/waagent/SharedConfig.xml)/16 sudo ifconfig ib0 $IP
The following is an optional step, applicable to all the three modes above, but not necessarily related to the above discussion of configuring IB. When running applications as non-root user, set the following memory limits in /etc/security/limits.conf.