Running HPC and AI workloads in containers in Azure

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction

Containers technologies are no longer something new in the industry. It all started focusing on how to deploy reproducible development environments but now you can find many other fields where applying containers, or some of the underlying technologies used to implement them, are quite common.

I will not cover here Azure Container Instances nor Azure Kubernetes Services. For an example of the latter you can browse this article NDv4 in AKS. ACI will be explained in another article.

Currently there are many options available when working with containers, Linux seasoned engineers quite likely have worked with LXC; later Docker revolutionized the deployment of development environments, more recently other alternatives like Podman have emerged and are now competing for a place in many fields.

However, in HPC, we have been working for some years with two different tools, Shifter as the first fully focused containers project for supercomputers and Singularity. I will show you how to use Singularity in HPC clusters running in Azure. I will also explain how to use Podman for running AI workloads using GPUs in Azure VMs.

Running AI workloads using GPU and containers

Running AI workloads do not need the presence of GPUs, but almost all the frameworks for machine learning/deep learning are designed to make use of them. So, I will assume GPU compute resources are required in order to run any AI workload.

There are many ways of taking advantage of GPU compute resource within containers. For example, you can run the whole container in privileged mode in order to get access to all the hardware available in the host VM, some nuances must be highlighted here because privileged mode cannot grant more permissions than those inherent to the user running the container. This means running a container as root in privileged mode is way different than running the container as a regular user with less privileges.

The most common way to get access to the GPU resources is via nvidia-container-toolkit, this package contains a hook in line with OCI standard (see references below) providing direct access to GPU compute resources within the container.

I will use a regular VM using Nvidia T4 Tesla GPU (NC8as_T4_v3) running RHEL 8.8. Let's get started.

These are all the steps required to run AI workloads using containers and GPU resources in a VM running in Azure:

A VM using any family of N-series (for AI workloads like machine learning, deep learning, etc... NC or ND are recommended) and a supported operating system.
Install CUDA drivers and CUDA toolkit if required. You can omit this if you are using DSVM images from Marketplace, these images come with all required drivers preinstalled.
Install your preferred container runtime environment and engine to work with containers.
Install nvidia-container-toolkit.
Run a container using any image with the tools required to check the GPU usage like nvidia-smi command. Using any container from NGC is more than recommended to avoid additional steps.
Create the image with your code or commit the changes in a running container.

I will start with step 2 because I'm sure there is no need to explain how to create a new VM with N-series.

Installing CUDA drivers

There is no specific restriction about which CUDA release must be installed. You have the freedom to choose the latest version from Nvidia website, for example.

$ wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo -O /etc/yum.repos.d/cuda-rhel8.repo
$ sudo dnf clean all
$ sudo dnf -y install nvidia-driver

Let's check if the drivers are installed correctly by using nvidia-smi command:

[root@hclv-jsaelices-nct4-rhel88 ~]# nvidia-smi
Fri Nov  3 17:41:03 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000001:00:00.0 Off |                  Off |
| N/A   51C    P0              30W /  70W |      2MiB / 16384MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Installing container runtime environment and engine

As I commented in the introduction, Podman will be our main tool to run containers. By default, Podman will use runc as the runtime environment, runc adheres to OCI standard so no additional steps to make sure nvidia-container-toolkit will work in our VM.

$ sudo dnf install -y podman

I won't explain here all the benefits of using Podman against Docker. I'll just mention Podman is daemonless and a most modern implementation of all technologies required to work with containers like control groups, layered filesystems and namespaces to name a few.

Let's verify Podman was successfully installed using podman info command:

[root@hclv-jsaelices-nct4-rhel88 ~]# podman info | grep -i ociruntime -A 19
  ociRuntime:
    name: runc
    package: runc-1.1.4-1.module+el8.8.0+18060+3f21f2cc.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      spec: 1.0.2-dev
      go: go1.19.4
      libseccomp: 2.5.2
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_SYS_CHROOT,CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false

Installing nvidia-container-toolkit

Podman fully supports OCI hooks and that is precisely what nvidia-container-toolkit provides. Basically, OCI hooks are custom actions performed during the lifecycle of the container. It is a prestart hook that is called when you run a container providing access to the GPU using the drivers installed in the host VM. The already created repository is also providing this package so let's install it using dnf:

$ sudo dnf install -y nvidia-container-toolkit

Podman is daemonless so no need to add the runtime using nvidia-ctk runtime configure, but, in this case, an additional step is required to generate the CDI configuration file:

$ sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
$ nvidia-ctk cdi list
INFO[0000] Found 2 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=all

Running containers for AI workloads

Now, we have all the environment ready for running new containers for AI workloads. I will make use of NGC images from Nvidia to save time and avoid the creation of custom ones. Please, keep in mind some of them are quite big so make sure you have enough space in your home folder.

Let's start with an Ubuntu 20.04 image with CUDA already installed on it:

[jsaelices@hclv-jsaelices-nct4-rhel88 ~]$ podman run --security-opt=label=disable --device=nvidia.com/gpu=all nvcr.io/nvidia/cuda:12.2.0-devel-ubuntu20.04

==========
== CUDA ==
==========

CUDA Version 12.2.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Another example running the well-known DeviceQuery tool that comes with CUDA toolkit:

[jsaelices@hclv-jsaelices-nct4-rhel88 ~]$ podman run --security-opt=label=disable --device=nvidia.com/gpu=all nvcr.io/nvidia/k8s/cuda-sample:devicequery-cuda11.7.1-ubuntu20.04
/cuda-samples/sample Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          12.2 / 11.7
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 15948 MBytes (16723214336 bytes)
  (040) Multiprocessors, (064) CUDA Cores/MP:    2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        65536 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   1 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 11.7, NumDevs = 1
Result = PASS

You can see in these examples that I'm running those containers with my user without root privileges (rootless environment) with no issues, and that is because of that option passed to the podman run command, --security-opt=label=disable. This command is used to disable all SELinux labeling. This is performed this way for the sake of this article's length. I could use a SELinux policy created with Udica or use the one that comes with Nvidia (nvidia-container.pp) but I preferred to disable the labeling for these specific samples.

Now it is time to try running specific frameworks for AI using Python. Let's try with Pytorch:

[jsaelices@hclv-jsaelices-nct4-rhel88 ~]$ podman run --rm -ti --security-opt=label=disable --device=nvidia.com/gpu=all pytorch/pytorch
root@7cb030cc3b47:/workspace# python
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>

As you can see PyTorch framework can see the GPU and would be able to run any code using GPU resources without any issue.

I won't create any custom image as suggested in the last step described previously. That can be a good exercise for the reader, so it is your turn to test your skills running containers and using GPU resources.

Running HPC workloads using containers

Now it is time to run HPC applications in our containers. You can also use podman to run those, in fact there is an improvement over podman developed jointly by NERSC and Red Hat called Podman-HPC but, for this article, I decided to use Singularity which is well-know in HPC field.

For this section, I will run some containers using Singularity in a cluster created with CycleCloud using HB120rs_v3 size for the compute nodes. For the OS, I've chosen Almalinux 8.7 HPC image from Azure Marketplace.

I will install Singularity manually but this can be automated using cluster-init in CycleCloud.

Installing Singularity in the cluster

In Almalinux 8.7 HPC image epel repository is installed by default so you can easily install singularity with a single command:

[root@slurmhbv3-hpc-2 ~]# yum install -y singularity-ce
Last metadata expiration check: 1:16:36 ago on Fri 03 Nov 2023 04:38:39 PM UTC.
Dependencies resolved.
=========================================================================================================================================================
 Package                          Architecture             Version                                                     Repository                   Size
=========================================================================================================================================================
Installing:
 singularity-ce                   x86_64                   3.11.5-1.el8                                                epel                         44 M
Installing dependencies:
 conmon                           x86_64                   3:2.1.6-1.module_el8.8.0+3615+3543c705                      appstream                    56 k
 criu                             x86_64                   3.15-4.module_el8.8.0+3615+3543c705                         appstream                   517 k
 crun                             x86_64                   1.8.4-2.module_el8.8.0+3615+3543c705                        appstream                   233 k
 libnet                           x86_64                   1.1.6-15.el8                                                appstream                    67 k
 yajl                             x86_64                   2.1.0-11.el8                                                appstream                    40 k
Installing weak dependencies:
 criu-libs                        x86_64                   3.15-4.module_el8.8.0+3615+3543c705                         appstream                    37 k

Transaction Summary
=========================================================================================================================================================
Install  7 Packages

Total download size: 44 M
Installed size: 135 M
Downloading Packages:
(1/7): criu-libs-3.15-4.module_el8.8.0+3615+3543c705.x86_64.rpm                                                          1.0 MB/s |  37 kB     00:00
(2/7): conmon-2.1.6-1.module_el8.8.0+3615+3543c705.x86_64.rpm                                                            1.1 MB/s |  56 kB     00:00
(3/7): crun-1.8.4-2.module_el8.8.0+3615+3543c705.x86_64.rpm                                                              5.3 MB/s | 233 kB     00:00
(4/7): libnet-1.1.6-15.el8.x86_64.rpm                                                                                    1.5 MB/s |  67 kB     00:00
(5/7): criu-3.15-4.module_el8.8.0+3615+3543c705.x86_64.rpm                                                               4.5 MB/s | 517 kB     00:00
(6/7): yajl-2.1.0-11.el8.x86_64.rpm                                                                                      954 kB/s |  40 kB     00:00
(7/7): singularity-ce-3.11.5-1.el8.x86_64.rpm                                                                             11 MB/s |  44 MB     00:04
---------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                    7.0 MB/s |  44 MB     00:06
Extra Packages for Enterprise Linux 8 - x86_64                                                                           1.6 MB/s | 1.6 kB     00:00
Importing GPG key 0x2F86D6A1:
 Userid     : "Fedora EPEL (8) <epel@fedoraproject.org>"
 Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1
 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-8
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                 1/1
  Installing       : yajl-2.1.0-11.el8.x86_64                                                                                                        1/7
  Installing       : libnet-1.1.6-15.el8.x86_64                                                                                                      2/7
  Running scriptlet: libnet-1.1.6-15.el8.x86_64                                                                                                      2/7
  Installing       : criu-3.15-4.module_el8.8.0+3615+3543c705.x86_64                                                                                 3/7
  Installing       : criu-libs-3.15-4.module_el8.8.0+3615+3543c705.x86_64                                                                            4/7
  Installing       : crun-1.8.4-2.module_el8.8.0+3615+3543c705.x86_64                                                                                5/7
  Installing       : conmon-3:2.1.6-1.module_el8.8.0+3615+3543c705.x86_64                                                                            6/7
  Installing       : singularity-ce-3.11.5-1.el8.x86_64                                                                                              7/7
  Running scriptlet: singularity-ce-3.11.5-1.el8.x86_64                                                                                              7/7
  Verifying        : conmon-3:2.1.6-1.module_el8.8.0+3615+3543c705.x86_64                                                                            1/7
  Verifying        : criu-3.15-4.module_el8.8.0+3615+3543c705.x86_64                                                                                 2/7
  Verifying        : criu-libs-3.15-4.module_el8.8.0+3615+3543c705.x86_64                                                                            3/7
  Verifying        : crun-1.8.4-2.module_el8.8.0+3615+3543c705.x86_64                                                                                4/7
  Verifying        : libnet-1.1.6-15.el8.x86_64                                                                                                      5/7
  Verifying        : yajl-2.1.0-11.el8.x86_64                                                                                                        6/7
  Verifying        : singularity-ce-3.11.5-1.el8.x86_64                                                                                              7/7

Installed:
  conmon-3:2.1.6-1.module_el8.8.0+3615+3543c705.x86_64                          criu-3.15-4.module_el8.8.0+3615+3543c705.x86_64
  criu-libs-3.15-4.module_el8.8.0+3615+3543c705.x86_64                          crun-1.8.4-2.module_el8.8.0+3615+3543c705.x86_64
  libnet-1.1.6-15.el8.x86_64                                                    singularity-ce-3.11.5-1.el8.x86_64
  yajl-2.1.0-11.el8.x86_64

Complete!

I won't explain all the pros and cons when using Singularity over other containers alternatives. I will just highlight some of the security features provided by Singularity and, especially, the format of the image used (Singularity Image Format, SIF) during the examples.

One of the biggest advantages of using Singularity is the size of the images, SIF is a binary format and is very compact comparing to regular layered Docker images. See below an example of the image of OpenFOAM:

[jsaelices@slurmhbv3-hpc-2 .singularity]$ singularity pull docker://opencfd/openfoam-default
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob 855f75e343f2 done   |
Copying blob b9158799e696 done   |
Copying blob 561d59533bc7 done   |
Copying blob 96b48e52a343 done   |
Copying blob a8cede8f862e done   |
Copying blob 3153aa388d02 done   |
Copying blob 3efcde42d95a done   |
Copying config dc7161e162 done   |
Writing manifest to image destination
2023/11/03 17:59:51  info unpack layer: sha256:3153aa388d026c26a2235e1ed0163e350e451f41a8a313e1804d7e1afb857ab4
2023/11/03 17:59:51  info unpack layer: sha256:855f75e343f27a0838944f956bdf15a036a21121f249957cf121b674a693c0c9
2023/11/03 17:59:51  info unpack layer: sha256:a8cede8f862e92aa526c663d34038c1152fb56f3e7005a1bcefd29219a77fd6f
2023/11/03 17:59:54  info unpack layer: sha256:561d59533bc76812ab48aef920990af0217af17b23aaccc059a5e660a2ca55b0
2023/11/03 17:59:54  info unpack layer: sha256:b9158799e696063a99dc698caef940b9e60ca7ff9c1edd607fc4688d953a1aa6
2023/11/03 17:59:54  info unpack layer: sha256:96b48e52a343650d16be2c5ba9800b30ff677f437379cc70e05c255d1212b52e
2023/11/03 18:00:03  info unpack layer: sha256:3efcde42d95ab617eac299e62eb8800b306a0279e9368daf2141337f22bf8218
INFO:    Creating SIF file...

You can see the size is about 350 MB:

[jsaelices@slurmhbv3-hpc-2 .singularity]$ ls -lh openfoam-default_latest.sif
-rwxrwxr-x. 1 jsaelices jsaelices 349M Nov  3 18:00 openfoam-default_latest.sif

Docker is using a layered format that is substantially bigger in size:

[root@slurmhbv3-hpc-1 ~]# docker images
REPOSITORY                 TAG       IMAGE ID       CREATED        SIZE
opencfd/openfoam-default   latest    dc7161e16205   3 months ago   1.2GB

Running MPI jobs with Singularity

Singularity is fully compatible with MPI and there are 2 different ways to submit an MPI job with SIF images.

I will use the bind method for its simplicity but you can also use the hybrid method if binding volumes between the host and the container is not desirable.

Let's create a simple definition file called mydefinition.def (similar to Dockerfile or Containerfile):

Bootstrap: docker
From: almalinux

%files
/shared/bin/mpi_test /shared/bin/mpi_test

%environment
export MPI_HOME=/opt/intel/oneapi/mpi/2021.9.0
export MPI_BIN=/opt/intel/oneapi/mpi/2021.9.0/bin
export LD_LIBRARY_PATH=/opt/intel/oneapi/mpi/2021.9.0/libfabric/lib:/opt/intel/oneapi/mpi/2021.9.0/lib/release:/opt/intel/oneapi/mpi/2021.9.0/lib:/opt/intel/oneapi/tbb/2021.9.0/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.9.0//libfabric/lib:/opt/intel/oneapi/mpi/2021.9.0//lib/release:/opt/intel/oneapi/mpi/2021.9.0//lib:/opt/intel/oneapi/mkl/2023.1.0/lib/intel64:/opt/intel/oneapi/compiler/2023.1.0/linux/lib:/opt/intel/oneapi/compiler/2023.1.0/linux/lib/x64:/opt/intel/oneapi/compiler/2023.1.0/linux/compiler/lib/intel64_lin
export MPI_INCLUDE=/opt/intel/oneapi/mpi/2021.9.0/include
export HOST=$(hostname)

%runscript
echo "Running MPI job inside Singularity: $HOST"
echo "MPI job submitted: $*"
exec echo

Here, I'm just using the Almalinux image from Docker Hub, copying the MPI application, defining some useful environment variables and a few simple commands to execute when the container is called without any parameter.

Now, it is time to build the SIF image:

[root@slurmhbv3-hpc-1 jsaelices]# singularity build mympitest.sif mpi_sample.def
INFO:    Starting build...
2023/11/03 18:07:49  info unpack layer: sha256:92cbf8f6375271a4008121ff3ad96dbd0c10df3c4bc4a8951ba206dd0ffa17e2
INFO:    Copying /shared/bin/mpi_test to /shared/bin/mpi_test
INFO:    Copying /shared/bin/openmpi-test to /shared/bin/openmpi-test
INFO:    Adding environment to container
INFO:    Adding runscript
INFO:    Creating SIF file...
INFO:    Build complete: mympitest.sif

I'm going to just execute the MPI application binding the folder where the whole Intel MPI is laying:

[jsaelices@slurmhbv3-hpc-1 ~]$ singularity exec --hostname inside-singularity --bind /opt/intel:/opt/intel mympitest.sif /shared/bin/mpi_test
Hello world: rank 0 of 1 running on inside-singularity

Let's call the app using mpiexec as we do with any other MPI job:

[jsaelices@slurmhbv3-hpc-1 ~]$ mpiexec -n 2 -hosts slurmhbv3-hpc-1 singularity exec --bind /opt/intel:/opt/intel mympitest.sif /shared/bin/mpi_test
Hello world: rank 0 of 2 running on slurmhbv3-hpc-1
Hello world: rank 1 of 2 running on slurmhbv3-hpc-1

In the next step, I will use SLURM scheduler to submit the job. In order to do that, I'm creating a very simple script:

#!/bin/bash
#SBATCH --job-name singularity-mpi
#SBATCH -N 2
#SBATCH -o %N-%J-%x
module load mpi/impi_2021.9.0
mpirun -n 4 -ppn 2 -iface ib0 singularity exec --bind /opt/intel:/opt/intel mympitest.sif /shared/bin/mpi_test

Let's submit the job with sbatch:

$ sbatch singularity.sh

Let's check the output file of the submitted job:

[jsaelices@slurmhbv3-hpc-1 ~]$ cat slurmhbv3-hpc-1-2-singularity-mpi
Hello world: rank 0 of 4 running on slurmhbv3-hpc-1
Hello world: rank 1 of 4 running on slurmhbv3-hpc-1
Hello world: rank 2 of 4 running on slurmhbv3-hpc-2
Hello world: rank 3 of 4 running on slurmhbv3-hpc-2

With this example this article ends.

You've seen how to run containers, how to make use of GPU and run AI workloads in a simple and effective way. You've also learnt how to run Singularity containers and MPI jobs easily. You can use all this material as a starting point to extend your knowledge and apply it to more complex tasks. Hope you enjoyed it.

References

Podman

Podman HPC

Nvidia Container Toolkit

Singularity containers