Using Moneo to characterize GPU & IB networks for Deep learning

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Contributed by: Rafael Salas

 

Moneo is Microsoft’s solution for characterizing hardware on distributed systems. To learn the basics of Moneo please visit the previous blogpost HERE

 

We’ve recently expanded Moneo to include CPU, Memory, and Ethernet by default. This unlocks new optimization experiences for MPI workloads. We can observe how CPU and IB traffic interplay on a single node or a collection of nodes to see if the rank design should be changed. If the pre-configured metrics still don’t provide the time series telemetry needed then we also offer custom metrics, which are anything that can be polled on the node as a function call.

 

Moneo works well alongside a compute cluster enabled with any job scheduler. A recommended solution in Azure is deploying a SLURM cluster using Azure Cyclecloud and enabling Moneo (Documentation for deployment resides HERE). SLURM offers a scheduling interface to the end-user. Azure Cyclecloud provides dynamic resources as needed by the SLURM scheduling environment and Moneo takes us through the last mile to show the underlying resource consumption by the jobs.  

 

Let’s look at how managing and troubleshooting a deep learning workload might look using Moneo and SLURM in Azure. A managed environment is deployed in Azure using ND A100 v4 VMs optimized for deep learning workloads. In Moneo, all the nodes are emitting telemetry to the database and a Graphana-enabled webhost is serving the different cluster view. A neural-network training job lands on the cluster and uses NCCL collectives to synchronize the computation across the grid. You can see the cycling of the resource consumption across compute iterations.  

 

The primary cluster wide view delves into various metrics around GPU utilization, memory utilization, GPU power, max throttle code. It is extremely useful when trying to understand the health of all the nodes at a glance. For Example, the diagram below shows how the node circled in red is running hotter than the other. This lens can further be narrowed to only show nodes that are working on a particular job as defined by a host file.

outlier node.png

                                                            Fig 1: Cluster wide view showing an outlier node

 

The detailed view of Moneo helps narrow in on the specifics. For Example, in the diagram below we can see that the outlier node above while running hot is not throttling. Moneo helps give a good understanding of how well the system is performing.

 

Detailed node view.png

                                                            Fig 2: Outlier node running hotter than others

 

Using Moneo you can easily narrow down scenarios like a zombie process i.e. the job has ended, but a process is still running on a node. The primary cluster wide view can help narrow down on the specifics              

 

zombie1.png

                                                                                          Fig 3:  Zombie process

 

Looking at the detailed view below, we can easily narrow down the node that still has a process running. This is especially useful when running a workload on hundreds of nodes.

 

zombie2.png

 

 

To learn more about Moneo and see it in action, come visit our booth at SC (2433).

 

#AzureHPCAI #MakeAIYourReality

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.