A quick start guide to benchmarking LLM models in Azure: NVIDIA NeMo Megatron – Results

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

By Hugo Affaticati (Technical Program Manager) and Jon Shelley (Principal TPM Manager)

 

Useful resources: 

NeMo Megatron from NVIDIA: NVIDIA NeMo Megatron

Container from NVIDIA: NVIDIA NGC

 

 

Below are the full results obtained with NVIDIA NeMo Megatron and Azure NDm A100 v4-series virtual machines (VMs) and a discussion on the parameters. NVIDIA NeMo Megatron is an end-to-end framework for training & deploying large language models (LLMs) with millions and billions of parameters.

 

 

Full results:  

 

All the results were obtained with the container 22.06-hotfix and BF16 data type on GPT-3 architecture. Performance of the benchmark is based on the time taken per step to train the model after the steady state is reached. We performed training runs for LLM models with 126 million parameters to 530 billion parameters using 1 to 175 NDm A100 v4 virtual machines. Each NDm A100 v4 VM is powered by eight NVIDIA A100 80GB Tensor Core GPUs and NVIDIA Mellanox HDR Infiniband networking to scale-out to multiple nodes for distributed training of LLM models. Find the detailed results below.

 

Model

Nodes

TP

PP

MBS

Training time/step (seconds)

126M

1

1

1

4

0.9

2

1

1

4

0.5

4

1

1

4

0.2

8

1

1

4

0.1

5B

2

2

1

2

37.4

10

2

1

2

7.7

20

2

1

2

4.3

20B

36

2

4

1

9.3

40B

16

4

4

1

36.1

32

4

4

1

18.8

175B

32

8

8

2

79.9

96

8

8

2

29.1

128

8

8

2

22.8

530B

105

8

35

1

88.2

140

8

35

1

67.4

175

8

35

1

55.7

 

 

Influence of the parameters:  

 

Number of nodes

Increasing the number of nodes allows to reduce the training time per global step – if it uses the full capacity of the nodes. Under that condition, Azure achieves great scaling. As it is shown on the graph below (figure 1), with the 530B model, the speedup for 140 nodes is 98.1% of 105 nodes, and for 175 nodes, it is 95.0% of 105 nodes.

 

HugoAffaticati_0-1666032020944.png

Figure 1 – Speedup in training time per step for NeMo Megatron GPT-3 architecture with 530B model

 

 

Tensor model parallel size (TP)

Since the models for NeMo Megatron are too large to fit in the memory of a single GPU, they are split across several nodes following tensor (intra-layer) and pipeline (inter-layer) model parallelism. Tensor model parallelism will enable this by portioning those individual transformer layers over several devices.  TP and PP parameters are correlated and can be tuned for optimal performance.  From the table below, we can conclude that the higher the TP, the slower the global step. While one could be tempted to decrease the value of TP, doing so induces an increase of the PP parameter since the model must fit across all the nodes.

 

Model

Nodes

TP

PP

MBS

Training time/step (seconds)

40B

32

8

4

1

21.3

32

4

4

1

18.8

Figure 2 – Influence of the TP parameters on the Training time

 

Pipeline model parallel size (PP)

Similarly, to tensor model parallelism, we studied the influence of the pipeline model parallelism (PP parameter). Pipeline model parallelism is used to partition the model layers into stages, and then spread those stages over multiple GPUs. From the table below with the 126M model, the training takes 1.5 times longer if we double the PP and 2.5 times longer if we quadruple the PP.

 

Model

Nodes

TP

PP

MBS

Training time/step (seconds)

126M

4

1

1

4

0.2

4

1

2

4

0.3

4

1

4

4

0.5

Figure 3 – Influence of the PP parameters on the Training time

 

As we have shown before, decreasing both the TP and PP parameters reduces the training time. From Figure 4 below, where the value of TP*PP is constant, one can conclude that the higher the TP, the slower the global step. Thus, it is recommended to favorize a lower TP to run the models. The NDm A100 v4-series VMs on Azure allows you to reach low values of both PP and TP across all models which corresponds to faster global steps.

 

Model

Nodes

TP

PP

MBS

Training time/step (seconds)

20B

36

8

1

1

15.8

36

2

4

1

9.3

Figure 4 – Influence of the TP and PP parameters on the Training time with TP*PP=constant

 

Mini batch size (MBS)

Simply, the higher the batch size, the faster the global step. The highest value for MBS is set by the memory limit of the GPU (80GB for NDm A100 v4-series VMs on Azure).

 

 

Recreate the results in Azure

 

To learn more about the milestone or how to recreate the results, please see the following link. 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.