Productivity and Training Acceleration with Azure Container for PyTorch

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Introduction

 

The Azure Container for PyTorch  (ACPT) is a curated environment in the Azure Machine Learning (AzureML) service with the latest PyTorch version and optimization software such as DeepSpeed and ONNX Runtime for training. The image is tested and optimized for the Azure AI infrastructure and ready to use by model developers for their training scenarios. The General Availability of ACPT in AzureML was launched with PyTorch 2.0  and Nebula for performance efficient training. 

 

The Microsoft Ads team serves Microsoft Search Network which powers 37.5% of U.S. desktop searches and 7.2 billion monthly searches around the globe. In the U.S., the Microsoft Search Network has 117 million unique searchers. (source: Microsoft Advertising | Search Engine Marketing (SEM) & more) The team’s mission is to empower every user and advertiser in the network to achieve more. As a part of the mission, Ads team builds state-of-the-art Ads relevance quality models to achieve the most satisfying experience for both users and advertisers, including TwinBERT which is continually contributing to Ads quality improvement.  

 

TwinBERT is a SOTA distilled model with twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings. It is a highly efficient transformer-based model with CPU latency for inference reduced to milliseconds. This allows TwinBERT to be servable in large-scale systems, while keeping the performance advances comparable to BERT. Due to its high effectiveness and efficiency, TwinBERT is widely adopted in the Microsoft Ads stack across different components with significant business impact.  

 

This blog presents a TwinBERT model trained with ACPT and demonstrating significant improvements in developer productivity and training efficiency. 

 

Developer Productivity 

 

The ACPT support for various acceleration techniques makes it effortless for users to optimize their training process and bring their projects to fruition in a timely manner. By taking advantage of ACPT, the users can focus on their own research, development, and experimentation instead of worrying about the compatibility and stability of their development environment, which leads to faster model development. With the support for advanced acceleration techniques provided by the ACPT, users can now take full advantage of these improvements to quickly train their models and bring their projects to market much faster. 

 

Incorporating ACPT into our machine learning workflow has never been easier. By selecting the latest version of the ACPT docker image and adding the necessary libraries for our specific training task, we were able to quickly create a customized docker image for our development tasks. The latest versions of ONNX and DeepSpeed frameworks have significantly simplified the integration of ACPT into our workflow, requiring only minor code alterations.  

 

Peng_Wang_0-1681766100339.png

Figure 1: Selecting the latest ACPT image in AzureML 

 

 

Peng_Wang_1-1681766100342.png

Figure 2: ACPT Image details 

 

Our best practice for storing customized Docker images is on cloud storage platforms, such as Microsoft Azure Container. By sharing these images via links, TwinBERT users can work on the same containerized environment without having to set up their own separate instances. This practice is very useful for TwinBERT users to work on their own tasks with the provided code and development frameworks. 

 

Efficiency Improvement 

 

Training state-of-the-art deep learning models such as TwinBERT can be a challenging and time-intensive task, especially when it comes to large amounts of data. Using traditional deep learning frameworks, training a TwinBERT model can take several days, making it difficult for developers to experiment with new ideas and iterate quickly. Fortunately, by leveraging cutting-edge advancements in parallel computing and hardware acceleration through ACPT, the development cycle for us can be significantly accelerated, allowing us to spend more time focusing on innovation and experimentation, instead of waiting for models to train. So, if you are a developer looking to accelerate your deep learning workflow, ACPT is a great option for you. 

 

We would like to highlight the advancements in training machine learning models made possible through the ACPT image. The table presented below displays the difference in acceleration techniques during the training process and it is evident that the combination of DeepSpeed and ORT results in incredible improvements. With a +25% training speedup and a 37% reduction in GPU memory usage, users are now able to train their models faster and more efficiently. This combination also maintains a comparable performance in terms of AUC, demonstrating that the use of these acceleration techniques yields both improved efficiency and accuracy for users training their machine learning models. The configuration used for training was V100-32G machines, batch size = 4096, 5 epochs. The chart below visualizes the data summarized in Table 1. 

 

Peng_Wang_2-1681766100345.png

Figure 3: Throughput and Memory improvements with ACPT Technologies 

 

Setting 

Normalized Training Speedup 

Memory usage (%) 

PyTorch 

1 

84 

PyTorch + DeepSpeed 

1.09 

63 

PyTorch + ORTModule 

1.22 

65 

PyTorch + DeepSpeed + ORTModule 

1.25 

53 

Table 1: Throughput and Memory Usage using ACPT

 

Looking Forward 

 

For the Bing Ads team, working with the ACPT image has greatly simplified the training process and opened new opportunities for model development. We invite model developers to try out the new ACPT image for accelerating your model training tasks. Along with the efficiencies from the latest version of PyTorch, you can also leverage ORT and DeepSpeed for training. ORT Training is available as the backed for acceleration in the Optimum library for training acceleration in Hugging Face. It is also interoperable across multiple hardware platforms (Nvidia and AMD) and composes well with the optimizations in both the torch.compile mode and DeepSpeed to provide accelerated training performance for large scale models. DeepSpeed has trained powerful language models like Megatron (530B) and Bloom and continues to bring the latest advancements in model training for its users. Large language model developers can use the ACPT image as an easy and efficient way to finetune and develop their models for their domain specific data. 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.