This post has been republished via RSS; it originally appeared at: Azure Compute Blog articles.
There has been greater interest in the usage of the HPC optimized VM images that we publish due to the:
- GA of SR-IOV enabled HPC VMs (HB, HC, HB_v2),
- Recent platform update to make NCr_v3 SR-IOV enabled,
- GA of NDr_v2
While those images (CentOS-HPC 7.6, 7.7) are originally targeted for use on the SR-IOV enabled HPC VMs (HB, HC, HB_v2), conceptually, they are useful for the other now SR-IOV enabled GPU VMs (NCr_v3, NDr_v2) too. Note that the GPU VMs would additionally require the Nvidia GPU drivers (VM extension, manually).
Typically we find that users of the HPC VMs running traditional HPC applications tend to utilize CentOS as their preferred OS. While users of AI/ML applications running on the GPU VMs tend to prefer Ubuntu as the OS. The CentOS-HPC VM OS images (>=7.6 for the SR-IOV enabled VMs, and <=7.5 for the non-SR-IOV enabled VMs) provide a ready to use VM image with the appropriate drivers and MPI runtimes.
This article attempts to consolidate guidance on configuring InfiniBand (IB) for Ubuntu across both SR-IOV and non-SR-IOV enabled HPC and GPU VMs. Specifically it will focus on getting the right drivers setup and in bringing up the appropriate IB interface on the VMs. At the time of writing, the following steps at least apply to Ubuntu 18.04 LTS image by Canonical on the Azure Marketplace.
NOTE: This article was written in March 2020. Many developments have happened since the time, including GA of new H* and N* VM sizes, as well as newer CentOS-HPC VM image versions. In fact, an Ubuntu-HPC VM image (for newer SR-IOV enabled VM sizes) is also now available. See the HPC VM image documentation and the TechCommunity blog on HPC VM images for more details.
Non- SR-IOV enabled VMs
The IB interface eth1 should come up with an RDMA IP address.
The IB related kernel modules are not auto-loaded on Ubuntu anymore. This is a departure from earlier practice where the kernel modules were built into the image. Now these are available as loadable modules so that a user can install Mellanox OFED driver.
Support for the NetworkDirect driver stack (vmbus-rdma-driver required in the non-SRIOV VMs) was dropped in the 5.3 kernel in the18.04-LTS 18.04.202004290 image in the Marketplace. This may lead to issues in bringing up the IB interface as reported here. This may be addressed with Canonical starting in kernel 5.4.
As a workaround, an older image with kernel 5.0 (say Canonical UbuntuServer 18.04-LTS 18.04.202004080 with 5.0.0-1036-azure kernel) has the missing module "hv_network_direct" and works fine.
Ubuntu 20.04 also does not show this issue.