Configuring InfiniBand for Ubuntu HPC and GPU VMs

This post has been republished via RSS; it originally appeared at: Azure Compute articles.

There has been greater interest in the usage of the HPC optimized VM images that we publish due to the:

  1. GA of SR-IOV enabled HPC VMs (HB, HC, HB_v2),
  2. Recent platform update to make NCr_v3 SR-IOV enabled,
  3. GA of NDr_v2

While those images (CentOS-HPC 7.6, 7.7) are originally targeted for use on the SR-IOV enabled HPC VMs (HB, HC, HB_v2), conceptually, they are useful for the other now SR-IOV enabled GPU VMs (NCr_v3, NDr_v2) too. Note that the GPU VMs would additionally require the Nvidia GPU drivers (VM extension, manually).

 

Typically we find that users of the HPC VMs running traditional HPC applications tend to utilize CentOS as their preferred OS. While users of AI/ML applications running on the GPU VMs tend to prefer Ubuntu as the OS. The CentOS-HPC VM OS images (>=7.6 for the SR-IOV enabled VMs, and <=7.5 for the non-SR-IOV enabled VMs) provide a ready to use VM image with the appropriate drivers and MPI runtimes. Such a pre-packaged and ready to use experience isn't yet available for Ubuntu on Azure.

 

This article attempts to consolidate guidance on configuring InfiniBand (IB) for Ubuntu across both SR-IOV and non-SR-IOV enabled HPC and GPU VMs. Specifically it will focus on getting the right drivers setup and in bringing up the appropriate IB interface on the VMs. At the time of writing, the following steps at least apply to Ubuntu 18.04 LTS image by Canonical on the Azure Marketplace.

 

Non- SR-IOV enabled VMs

  1. Install dapl (and its dependencies rdma_cm, ibverbs), and user mode mlx4 library.

    sudo apt-get update sudo apt-get install libdapl2 libmlx4-1
  2. In /etc/waagent.conf, enable RDMA by uncommenting the following configuration lines (root access)

    OS.EnableRDMA=y OS.UpdateRdmaDriver=y
  3. Restart the waagent servicesudo systemctl restart walinuxagent.service

     

The IB interface eth1 should come up with an RDMA IP address.

 

The IB related kernel modules are not auto-loaded on Ubuntu anymore. This is a departure from earlier practice where the kernel modules were built into the image. Now these are available as loadable modules so that a user can install Mellanox OFED driver.

 

SR-IOV enabled VMs with inbox driver

  1. Load following kernel modules (either mpdprob or edit /etc/modules)

    ib_uverbs rdma_ucm ib_umad ib_ipoib
  2. Reboot VM

    sudo reboot

     

The IB interface ib0 should come up with an RDMA IP address.

 

SR-IOV enabled VMs with OFED driver

  1. The appropriate Mellanox OFED driver an be downloaded and installed as referenced below

    wget http://content.mellanox.com/ofed/MLNX_OFED-5.0-1.0.0.0/MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64.tgz tar zxvf MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64.tgz sudo ./MLNX_OFED_LINUX-5.0-1.0.0.0-ubuntu18.04-x86_64/mlnxofedinstall --add-kernel-support
  2. Load the new driver

    sudo /etc/init.d/openibd restart
  3. Assign the RDMA IP address to the ib0 interface IP=$(sudo sed '/rdmaIPv4Address=/!d;s/.*rdmaIPv4Address="\([0-9.]*\)".*/\1/' /var/lib/waagent/SharedConfig.xml)/16 sudo ifconfig ib0 $IP

 

The following is an optional step, applicable to all the three modes above, but not necessarily related to the above discussion of configuring IB. When running applications as non-root user, set the following memory limits in /etc/security/limits.conf.

<user or group_name or *> hard memlock <memory_required_by_application_in_KB or unlimited> <user or group_name or *> soft memlock <memory_required_by_application_in_KB or unlimited>

REMEMBER: these articles are REPUBLISHED. Your best bet to get a reply is to follow the link at the top of the post to the ORIGINAL post! BUT you're more than welcome to start discussions here:

This site uses Akismet to reduce spam. Learn how your comment data is processed.