This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
By Hugo Affaticati (Technical Program Manager – Microsoft), Annika Brundyn (Solutions Architect – NVIDIA) and Jon Shelley (Principal TPM Manager – Microsoft)
Useful resources:
NeMo Megatron from NVIDIA: NVIDIA NeMo Megatron
Container from NVIDIA: NVIDIA NGC
Below are the steps one needs to take to run GPT-3 architecture models with NeMo Megatron on NDm A100 v4-series on Azure, powered by NVIDIA A100 80GB Tensor Core GPUs and NVIDIA InfiniBand networking. NVIDIA NeMo Megatron is an end-to-end framework for training & deploying LLMs with billions and trillions of parameters – NVIDIA.
Deploy the environment:
Deploy and set up a CycleCloud cluster (Azure CycleCloud 8.2 and slurm 2.6.5) of NDm A100 v4 virtual machines by following this script. More info about the NDm A100 v4-series on the Microsoft Docs product page.
Add storage while deploying:
Under Network Attached Storage on the CycleCloud portal, select NFS type “buildin” and make the size 4TB.
Start the scheduler first and then the compute nodes.
Set up the environment:
SSH into the scheduler and set the number of compute nodes in your cluster before starting the setup:
export NN=<number of nodes>
Make a directory for the credentials before downloading the container. This directory must be accessible from all the nodes so it should be under the /shared directory:
sudo chmod 1777 /shared
mkdir -p /shared/tmp
Copy your credentials from NVIDIA NGC under “Get API Key”, then “Generate API Key”. Paste your password in the following command:
echo "machine nvcr.io login \$oauthtoken password <ADD YOUR PASSWORD>" > /shared/tmp/.credentials
Copy the file with your credentials to all compute nodes:
srun -p ndmv4 -N $NN bash -c "mkdir -p ~/.config/enroot/ && cp /shared/tmp/.credentials ~/.config/enroot/"
Get the NeMo Megatron container:
Make a directory to pull the container and extract the NeMo Megatron scripts:
mkdir -p /shared/ngc_scripts
Update the Docker root directory in the docker daemon configuration file
sudo vi /etc/docker/daemon.json
Add the line after the first curly bracket
"data-root": "/mnt/resource_nvme/data",
Set the driver capacities before starting the setup:
export NVIDIA_DRIVER_CAPABILITIES=compute,utility
Pull the NeMo-Megatron Training container and extract the scripts to be used for training:
srun -p ndmv4 -N 1 --container-mounts=/shared/ngc_scripts:/workspace/mount_dir --container-image=nvcr.io/ea-bignlp/bignlp-training:22.06-hotfix.01-py3 bash -c "cp -r /opt/bignlp/bignlp-scripts /opt/bignlp/bignlp-hp-tool /workspace/mount_dir/"
Install the requirements and dependencies:
Update the requirements on the scheduler:
cd /shared/ngc_scripts/bignlp-scripts
pip3 install -r requirements.txt
pip3 install tensorboard
Test the cluster:
Before cluster validation (which includes NCCL testing and DCGM Diagnostics), we need to modify a few scripts to point to the correct versions of software stacks (like HPCx and PyTorch), as well as add important flags (like number of GPUs per node):
cd /shared/ngc_scripts/bignlp-scripts/csp/azure
sed -i 's/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu18.04-x86_64/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64/g' nccl.sh
sed -i '21,29d' nccl.sh
sed -i 's/pytorch:21.09-py3/pytorch:22.06-py3/g' build-nccl-tests.sh
sed -i 's/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu18.04-x86_64/hpcx-v2.9.0-gcc-MLNX_OFED_LINUX-5.4-1.0.3.0-ubuntu20.04-x86_64/g' build-nccl-tests.sh
sed -i 's/-w $NODES/-w $NODES --gpus-per-node 8/g' cluster_validation.sh
sed -i 's/--parsable/--parsable --gpus-per-node 8/g' cluster_validation.sh
sed -i 's/bash -c/--gpus-per-node 8 bash -c/g' dcgmi_diag.sh
Generate the topology file and copy it to all compute nodes:
mkdir /shared/topo
sbatch -p ndmv4 -N 1 -o /shared/topo/ndv4-topo.xml gentopo.sh
srun -p ndmv4 -N $NN bash -c "sudo mkdir -p /opt/microsoft && sudo cp /shared/topo/ndv4-topo.xml /opt/microsoft/"
Modify line 143 of the cluster_validation.sh file to specify the partition.
cd /shared/ngc_scripts/bignlp-scripts/csp/azure
vi cluster_validation.sh
line 143: sbatch -p ndmv4 -N 1 -W build-nccl-tests.sh > /dev/null 2> /dev/null
Run the tests. It is expected that the DCGM runs for approximately 30 minutes and the NCCL test will take an extra 15 minutes for the first run. The NCCL test will not run if the DCGM test fails, but you can start it by adding the tag --nccl at the end of the command line.
cd /shared/ngc_scripts/bignlp-scripts/csp/azure
bash cluster_validation.sh --nodes=$NN --nodelist=<NAME_OF_YOUR_NODES> --partition=ndmv4
Get the gpt3 data for NeMo Megatron:
Modify the configuration files:
vi /shared/ngc_scripts/bignlp-scripts/conf/cluster/bcm.yaml
change partition: ndmv4
and:
vi /shared/ngc_scripts/bignlp-scripts/conf/data_preparation/download_gpt3_pile.yaml
change file_numbers: "0-1"
Change the following values in /shared/ngc_scripts/bignlp-scripts/conf/config.yaml
run_data_preparation: True
run_training: False
run_conversion: False
run_finetuning: False
run_evaluation: False
bignlp_path: /shared/ngc_scripts/bignlp-scripts
data_dir: /shared/data/NeMo
container_mounts:
- /opt/microsoft:/opt/microsoft
env_vars:
NCCL_TOPO_FILE: /opt/microsoft/ndv4-topo.xml
UCX_IB_PCI_RELAXED_ORDERING: auto
NCCL_IB_PCI_RELAXED_ORDERING: 2
NCCL_IB_TIMEOUT: 22
Start downloading the data with the following command:
cd /shared/ngc_scripts/bignlp-scripts
HYDRA_FULL_ERROR=1 python3 main.py \
training=gpt3/126m \
training.run.name=data_preparation \
training.run.time_limit="10:00:00" \
training.trainer.num_nodes=$NN \
training.trainer.max_steps=200 \
training.trainer.log_every_n_steps=1 \
training.exp_manager.resume_if_exists=False \
training.model.micro_batch_size=4 \
training.model.tensor_model_parallel_size=1 \
training.model.pipeline_model_parallel_size=1 \
training.model.activations_checkpoint_num_layers=0
Run NeMo Megatron:
Modify the number of shards and their weights in each yaml file under /shared/ngc_scripts/bignlp-scripts/conf/training/gpt3. You should have only two shards with a 0.5 weight each. The file should look as follows:
- 0.5
- ${data_dir}/my-gpt3_00_text_document
- 0.5
- ${data_dir}/my-gpt3_01_text_document
Switch from data preparation to training.
vi /shared/ngc_scripts/bignlp-scripts/conf/cluster/bcm.yaml
change run_data_preparation: False
run_training: True
Use the following command to start training. The following arguments must be changed accordingly to the benchmark you want to run (model, name, number of nodes, MBS, TP and PP). This step is expected to take several hours.
cd /shared/ngc_scripts/bignlp-scripts
HYDRA_FULL_ERROR=1 python3 main.py \
training=gpt3/126m \
training.run.name=gpt3_126m-1n-tp1-pp1-mbs1 \
training.run.time_limit="10:00:00" \
training.trainer.num_nodes=$NN \
training.trainer.max_steps=200 \
training.trainer.log_every_n_steps=1 \
training.exp_manager.resume_if_exists=False \
training.model.micro_batch_size=4 \
training.model.tensor_model_parallel_size=1 \
training.model.pipeline_model_parallel_size=1 \
training.model.activations_checkpoint_num_layers=0
Get the results:
In a separate terminal, connect to your scheduler using the following ssh command:
ssh -L 4444:localhost:4444 <USER NAME>@<IP ADDRESS>
Start TensorBoard and point to the results directory for NeMo-Megatron:
tensorboard --logdir=/shared/ngc_scripts/bignlp-scripts/results --port=4444
Visualize your results using TensorBoard on your web browser.
http://localhost:4444/
Under the train_step_timing window you can find the time it takes per global step. Calculating the average of a few steps, for example 149, 169 and 189, gives a good indication of the time per step once the steady state is reached.
To learn more about this milestone or the full results, please see the following links.
- Azure Scales 530B Parameter GPT-3 Model with NVIDIA NeMo Megatron
- A quick start guide to benchmarking LLM models in Azure: NVIDIA NeMo Megatron - Results