This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
Azure CycleCloud is an enterprise-friendly tool for orchestrating and managing High Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, and automatically size the infrastructure to run jobs efficiently at any scale. Through CycleCloud, users can create different types of file systems and mount them to the compute cluster nodes to support HPC workloads. Azure CycleCloud is targeted at HPC administrators and users who want to deploy an HPC environment with a specific scheduler in mind -- commonly used schedulers such as Slurm, PBS Professional, Grid Engine, and LSF are supported out of the box.
Coming to SC23? Visit us at the Microsoft booth to see a live demo!
8.5 Feature highlights
Node Health Checks and Overprovisioning
One of the main design goals of Azure CycleCloud is removing the complexity of orchestrating large scale, dynamic HPC environments. As compute sizes continue to grow, and customer workloads get more scalable, it’s important to ensure that all the virtual machines (VMs) deployed as part of the cluster are usable by jobs. While it’s possible to do this today with job prologues and custom scheduler integration, running these checks after the nodes have registered with the scheduler is usually too late and introduces even more delay before jobs can start running.
In CycleCloud 8.5, we’re introducing more robust node health checks that are run by the nodes before they’re ever registered with the scheduler. As nodes spin up, health checks will run to ensure network interfaces are configured correctly, InfiniBand connectivity is correct, and GPUs are healthy before the nodes are added to the scheduler. This can also happen as part of the overprovisioning process, so it’s entirely transparent to the end user when failures occur. If nodes fail their health checks, a diagnostic report is sent back to CycleCloud to obviate the need for keeping bad nodes around for debugging.
Confidential Computing support
Many of our HPC and AI customers on Azure have sensitive data in their workloads, ranging from PII to proprietary company data. Azure Confidential Computing encrypts data in memory in hardware-based trusted execution environments and processes it only after the cloud environment is verified. It helps prevent data access by cloud providers, administrators, and users.
With CycleCloud 8.5, customers can now utilize Confidential Computing VMs to extend protection of that information not only for data at rest or in transit, but also for data in use.
By combining Azure CycleCloud to create and manage your HPC clusters with Azure Confidential Computing to protect your sensitive and regulated data while it’s being processed, you can create a secure and scalable HPC environment. Learn more about Azure Confidential Computing here.
Note: Only supports Platform Managed Keys for this release, not Customer Managed Keys.
Confidential Computing VMs only support standalone VMs today, VMSS support is still in preview and will be available Q1 2024.
CLI Python bundling
One common issue customers face while upgrading CycleCloud is that the CLI uses the system OS Python, which can cause problems with older operating systems, or newer operating systems and older version of the CLI. Starting with CycleCloud 8.5, Python will be bundled with the CLI install.
Support for mounting Azure Managed Lustre
Lustre is one of the leading parallel filesystems used in HPC and AI clusters for large-scale MPI workloads. Azure Managed Lustre makes it easy for customers to deploy and manage their Lustre filesystems and integrates with Azure Blob storage for more flexible data placement, data processing, and cost management. You can use Azure Managed Lustre to accelerate HPC jobs. With storage capable of reaching hundreds of GBps, you can scale up your compute clusters and finish jobs in a fraction of the time compared to existing IP storage services.
Soon you will be able to mount your Azure Managed Lustre filesystem in your clusters just like NFS mounts. CycleCloud has prebuilt HPC images with Lustre client packages for Ubuntu 18.04, 20.04, 22.04, and Alma 8.7. You can also download the Lustre client packages from packages.microsoft.com for your desired Linux distribution and kernel version.
Entra ID Support for GUI authentication
CycleCloud already has support for built-in authentication as well as LDAP and Active Directory. Soon customers will also be able to authenticate to the WebUI against Azure Entra ID. This will enable customers take advantage of all of the secure access, seamless user experience, and unified identity management that Entra has to offer while logging in to CycleCloud.
Bundled packages in Microsoft package repositories
Until now, open source scheduler packages used by Azure CycleCloud have been hosted as release artifacts published with each release. For customers running in locked down environments, opening up internet access to GitHub can be an issue. These packages will soon be hosted in the official Microsoft package repositories to allow for easier access from more restricted environments.
Try it out
To get started with Azure CycleCloud, you can follow these steps: