GPU node health checks integrated into Azure Kubernetes service via node problem detector

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction

Large AI model training can take months to complete on very large AI supercomputers. These AI supercomputers consist of many high-end GPU’s (e.g NVIDIA A100 or H100) all connected with InfiniBand. The Azure NDv5 has 8 H100 GPU’s, each connected directly by NVlink 4 (on a node) and each GPU has a 400 Gbps IB link that enables it to communicate with all the other GPU’s on the AI Supercomputer.

AI model training workloads are tightly coupled, at regular intervals all the gradients need to be updated using NCCL collective communication. If any of the gpus or InfiniBand links fail (e.g. dropped GPU, InfiniBand link flap etc) this can cause the complete job to terminate (and require it to be restarted from the last checkpoint). It is imperative that any unhealthy nodes/IB fabric be identified to prevent them being included in any of the nodes used in the training job.

The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate a few of the GPU node health checks into AKS (Azure kubernetes service) in such a way that

GPU node health checks are run at regular intervals.
Nodes which fail any of the GPU tests will be automatically cordoned off (to prevent any jobs being scheduled on them) and optionally drained (all pods removed from node)

We will be leveraging Node problem detector (NPD) to run the specific GPU node health checks and draino to cordon/drain any nodes that fail any of the GPU node health checks.

Screenshot 2024-06-27 105536.png

GPU node health check integration into NPD

NPD is commonly used in K8s environments to run various k8s cluster health checks and report any issues via k8s events/conditions to k8s api server. The k8s cluster can then take some action depending on how serious the condition is (e.g. for some permanent conditions, the node may be cordoned off and drained). We will leverage the NPD custom plugin

Note: GPU count, GPU NVlink, GPU XID and GPU ECC health checks are included (other GPU node health checks can also be easily included).

Get the NPD github repository

git clone http://github.com/kubernetes/node-proble-detector.git

Edit the NPD Makefile (get modified file here)

Build for linux_amd64 only (not ARM)

LINUX_PLATFORMS=linux_amd64

DOCKER_PLATFORMS=linux/amd64

Provide a unique tag

TAG?=$(VERSION)_<UNIQUE NUMBER>

Change registry to Azure ACR

REGISTRY?=<YOUR ACR>.azurecr.io/k8s-staging-npd

Change the BASEIMAGE

BASEIMAGE:=nvcr.io/nvidia/pytorch:23.03-py3

Edit NPD Dockerfile (get modified file here)

Change base container

FROM nvcr.io/nvidia/pytorch:23.03-py3 as builder-base

Install golang in container

COPY go1.22.4.linux-amd64.tar.gz .

RUN rm -rf /usr/local/go && tar -C /usr/local -xzf go1.22.4.linux-amd64.tar.gz

Remove unnecessary ARM packaged

#RUN clean-install util-linux bash libsystemd-dev

Edit entrypoint

ENTRYPOINT ["/node-problem-detector", "--config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json"]

Note: You can get the golang tarball here, go1.22.4.linux-amd64.tar.gz

Build NPD without SystemLogMonitor and SystemStatsMonitor. AKS has its own NPD which will run complete monitoring, we only want our NPD to just run the GPU node tests.

BUILD_TAGS="disable_system_log_monitor disable_system_stats_monitor" make 2>&1 | tee make.out

Push the container image to ACR

make push 2>&1 make_push.out

You could add all the GPU node health check plugins and scripts to the NPD container, but it’s much more flexible to use a k8s configMap to inject them directly into the container at runtime.

Edit deployment/node-problem-detector-config.yaml add the GPU custom plugin (yaml file) and gpu health check scripts (bash scripts) to the k8s ConfigMap yaml file. (get modified file here)

Note: You can control the frequency in which the tests are run, there are parameters in the custom plugin yaml files.

Edit deployment/node-problem-detector.yaml. (get modified file here)

NPD command line

- --config.custom-plugin-monitor=/config/custom-plugin-gpu-count.json,/config/custom-plugin-gpu-nvlink.json,/config/custom-plugin-gpu-xid.json, ,/config/custom-plugin-gpu-ecc.json

Which image/container to use

image: <YOUR ACR>.azurecr.io/k8s-staging-npd/node-problem-detector:<YOUR TAG>

Container limits

cpu: 240m

memory: 2048Mi

Bash script permissions

defaultMode: 0777

Which files to inject into the container.

- key: kernel-monitor.json

path: kernel-monitor.json

- key: docker-monitor.json

path: docker-monitor.json

- key: custom-plugin-monitor.json

path: custom-plugin-monitor.json

- key: check_ntp.sh

path: plugin/check_ntp.sh

- key: custom-plugin-gpu-count.json

path: custom-plugin-gpu-count.json

- key: check_gpu_count.sh

path: plugin/check_gpu_count.sh

- key: custom-plugin-gpu-nvlink.json

path: custom-plugin-gpu-nvlink.json

- key: check_gpu_nvlink.sh

path: plugin/check_gpu_nvlink.sh

- key: custom-plugin-gpu-xid.json

path: custom-plugin-gpu-xid.json

- key: check_gpu_xid.sh

path: plugin/check_gpu_xid.sh

Note: I have shown how to integrate 4 GPU node health checks, other GPU health checks can be easily added.

Note: You will probably need to modify the container limits (cpu/memory) depending on how many and what GPU tests you are running.

Draino set-up

The draino set-up is easy, we just need to tell draino which GPU node health check events/conditions to act on (e.g. cordon/drain).

Get the draino repository

git clone https://github.com/planetlabs/draino.git

Build and push draino image/container to your ACR

docker build -t <YOUR ACR>.azurecr.io/draino .

docker push <YOUR ACR>.azurecr.io/draino

Edit the drain manifest yaml file (get modified file here)

Add correct service account permission/rules so draino can access the k8s service

rules:

- apiGroups: ['']

resources: [events]

verbs: [create, patch, update]

- apiGroups: ['']

resources: [nodes]

verbs: [get, watch, list, update, patch]

- apiGroups: ['']

resources: [nodes/status]

verbs: [patch, watch, list, update, patch]

- apiGroups: ['']

resources: [endpoints]

verbs: [get,watch, list, create, patch, update]

- apiGroups: ['']

resources: [pods]

verbs: [get, watch, list]

- apiGroups: ['']

resources: [pods/eviction]

verbs: [create]

- apiGroups:

- extensions

- apps

resources: [daemonsets]

verbs: [get, watch, list]

Draino command line (Only cordon GPU nodes with these GPU conditions)

command: [/draino, --skip-drain, --node-label=accelerator=nvidia, GpuCount, GpuNvlink, GpuXid, GpuEcc]

Select the correct image/container

image: <YOUR ACR>.azurecr.io/draino:latest

Testing NPD+Draino GPU health checks

Prerequisites

You have a working AKS cluster. In this test we will be using a NDmv4 nodepool (See here on how to deploy an NDmv4 AKS nodepool).

Deploy NPD+GPU health checks

kubectl apply -f rbac.yaml

kubectl apply -f node-problem-detector-config.yaml

kubectl apply -f node-problem-detector.yaml

Note: You should see the node-problem-detector daemonset running on NDmv4 nodes.

Deploy special draino deployment with support for GPU node health checks

kubectl apply -f manifest.yml

Note: You should see the draino deployment.

Screenshot 2024-06-25 181534.png

Verify that the GPU node health checks are running (Check the NDmv4 node description and look at the node events/conditions.

Screenshot 2024-06-25 182500.png

You can see the GpuNvlink, GpuXid and CpuCount conditions reporting normal status.

Now, to simulate a GPU node health check failure, we will drop one of the NDmv4 GPU’s.

nvidia-smi -i 00000001:00:00.0 -pm 0

nvidia-smi drain -p 0001:00:00.0 -m 1

Note: nvidia-ami will verify that there are 7 GPU’s (instead of the expected 8).

Check the NDmv4 node events/conditions (via node description). If shows that the GPU count test has failed, and the node has been automatically cordoned by draino (i.e. no pods can be scheduled to this node).

Screenshot 2024-06-25 183153.png

Some additional considerations

NPD is set to run periodically and can overlap with a customer’s job. The timing and type of GPU node health checks you run may affect how well the customer job performs. One possible strategy is to perform thorough node health checks on an empty cluster from time to time and to run some essential GPU node health checks that do not affect performance on regular intervals.

Conclusion

Fully automated GPU specific health checks integrated into AKS, that

identify unhealthy GPU nodes
cordon nodes

helps to improve the reliability of large AI supercomputers running training jobs. In this blog post we showed how to integrate GPU specific health checks into NPD and then have draino look for specific GPU failure conditions and take some action (e.g cordon/drain node).

Introduction

GPU node health check integration into NPD

Draino set-up

Testing NPD+Draino GPU health checks

Prerequisites

Some additional considerations

Conclusion

Leave a Reply Cancel reply