A quick start guide to benchmarking LLM models in Azure: NVIDIA NeMo Megatron – Results

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

By Hugo Affaticati (Technical Program Manager) and Jon Shelley (Principal TPM Manager)

Useful resources:

NeMo Megatron from NVIDIA: NVIDIA NeMo Megatron

Container from NVIDIA: NVIDIA NGC

Below are the full results obtained with NVIDIA NeMo Megatron and Azure NDm A100 v4-series virtual machines (VMs) and a discussion on the parameters. NVIDIA NeMo Megatron is an end-to-end framework for training & deploying large language models (LLMs) with millions and billions of parameters.

Full results:

All the results were obtained with the container 22.06-hotfix and BF16 data type on GPT-3 architecture. Performance of the benchmark is based on the time taken per step to train the model after the steady state is reached. We performed training runs for LLM models with 126 million parameters to 530 billion parameters using 1 to 175 NDm A100 v4 virtual machines. Each NDm A100 v4 VM is powered by eight NVIDIA A100 80GB Tensor Core GPUs and NVIDIA Mellanox HDR Infiniband networking to scale-out to multiple nodes for distributed training of LLM models. Find the detailed results below.

Model	Nodes	TP	PP	MBS	Training time/step (seconds)
126M	1	1	1	4	0.9
	2	1	1	4	0.5
	4	1	1	4	0.2
	8	1	1	4	0.1
5B	2	2	1	2	37.4
	10	2	1	2	7.7
	20	2	1	2	4.3
20B	36	2	4	1	9.3
40B	16	4	4	1	36.1
40B	32	4	4	1	18.8
175B	32	8	8	2	79.9
	96	8	8	2	29.1
	128	8	8	2	22.8
530B	105	8	35	1	88.2
	140	8	35	1	67.4
	175	8	35	1	55.7

Influence of the parameters:

Number of nodes

Increasing the number of nodes allows to reduce the training time per global step – if it uses the full capacity of the nodes. Under that condition, Azure achieves great scaling. As it is shown on the graph below (figure 1), with the 530B model, the speedup for 140 nodes is 98.1% of 105 nodes, and for 175 nodes, it is 95.0% of 105 nodes.

Figure 1 – Speedup in training time per step for NeMo Megatron GPT-3 architecture with 530B model

Tensor model parallel size (TP)

Since the models for NeMo Megatron are too large to fit in the memory of a single GPU, they are split across several nodes following tensor (intra-layer) and pipeline (inter-layer) model parallelism. Tensor model parallelism will enable this by portioning those individual transformer layers over several devices. TP and PP parameters are correlated and can be tuned for optimal performance. From the table below, we can conclude that the higher the TP, the slower the global step. While one could be tempted to decrease the value of TP, doing so induces an increase of the PP parameter since the model must fit across all the nodes.

Model	Nodes	TP	PP	MBS	Training time/step (seconds)
40B	32	8	4	1	21.3
40B	32	4	4	1	18.8

Figure 2 – Influence of the TP parameters on the Training time

Pipeline model parallel size (PP)

Similarly, to tensor model parallelism, we studied the influence of the pipeline model parallelism (PP parameter). Pipeline model parallelism is used to partition the model layers into stages, and then spread those stages over multiple GPUs. From the table below with the 126M model, the training takes 1.5 times longer if we double the PP and 2.5 times longer if we quadruple the PP.

Model	Nodes	TP	PP	MBS	Training time/step (seconds)
126M	4	1	1	4	0.2
	4	1	2	4	0.3
	4	1	4	4	0.5

Figure 3 – Influence of the PP parameters on the Training time

As we have shown before, decreasing both the TP and PP parameters reduces the training time. From Figure 4 below, where the value of TP*PP is constant, one can conclude that the higher the TP, the slower the global step. Thus, it is recommended to favorize a lower TP to run the models. The NDm A100 v4-series VMs on Azure allows you to reach low values of both PP and TP across all models which corresponds to faster global steps.

Model	Nodes	TP	PP	MBS	Training time/step (seconds)
20B	36	8	1	1	15.8
20B	36	2	4	1	9.3

Figure 4 – Influence of the TP and PP parameters on the Training time with TP*PP=constant

Mini batch size (MBS)

Simply, the higher the batch size, the faster the global step. The highest value for MBS is set by the memory limit of the GPU (80GB for NDm A100 v4-series VMs on Azure).

Recreate the results in Azure

To learn more about the milestone or how to recreate the results, please see the following link.

Leave a Reply Cancel reply