This post has been republished via RSS; it originally appeared at: Azure Compute Blog articles.
Recently, the Azure HPC team and some customers have observed that a bug in older versions of the public Linux kernel can be triggered in ways that can cause poor or variable performance for HPC applications running on Azure H-series VMs.
Specifically, the bug is in the rhastable shrink logic and is present in Linux kernel versions 3.10.0-1062.12.1.el7 and below. These kernel versions are defaults for commonly used Linux OSes for HPC, such as CentOS 7.6 and CentOS 7.7 VMs.
As described in this post, this bug causes endless “insert_work” invocations because of repeated calls to rht_deferred_worker(). Thankfully, this kernel bug is fixed in kernel versions 3.10.0-1127.el7 and above.
More details about his bug can be found at the following resources:
When this bug is encountered, one of the kernel worker threads will consume nearly 100% of a CPU core as shown below:
If you trace the kernel thread, you will see continuous invocations of rht_deferred_work.
Impact of this bug on HPC Application Performance:
The performance impact is severe especially when all CPU cores are used. As you can see in the following tests using four HC44rs VMs with CentOS 7.7 HPC VM (kernel version 3.10.0-1062.12.1.el7), OpenFOAM shows very large performance variations over different runs. Yet, performance is consistent when using the CentOS 7.7 HPC image with kernel version 3.10.0-1127.el7 as well as CentOS 8.1 HPC VM.
Similar behavior can be seen with MPI RandomAccess performance results shown below. CentOS 7.7 HPC VM running kernel version 3.10.0-1062.12.1.el7 shows big performance variations. But, performance is stable and consistent on CentOS 7.7 HPC VM with kernel 3.10.0-1127.el7 and CentOS 8.1 HPC VM.
- If you are currently using CentOS 7.6/7.7 HPC images and with kernel version is 3.10.0-1062.12.1.el7 or below, then use one of the following methods for resolution.
1.Update Kernel: The kernel bug fix is included in 3.10.0-1127.el7.x86_64, and this update is available for both CentOS 7.6 and CentOS 7.7. Please do the following to update your kernel:
2. Switch to CentOS 8.1 HPC Image: You can switch to CentOS 8.1 HPC image as it has the kernel bug fix already.
3. New CentOS 7.6/7.7 HPC Image: Updates of Azure’s optimized CentOS 7.6/7.7 HPC images are being prepared and will be available soon in the Azure Marketplace.
- CentOS 7.7 HPC Image: OpenLogic:CentOS-HPC:7.7:7.7.2020062600, OpenLogic:CentOS-HPC:7_7-gen2:7.7.2020062601 or newer versions.
- CentOS 7.6 HPC Image: OpenLogic:CentOS-HPC:7.6:7.6.2020062900, OpenLogic:CentOS-HPC:7_6gen2:7.6.2020062901 or newer versions.
- If you are currently using CentOS 8.1 HPC images:
No action is required, you are not impacted by this kernel bug.