Automate BeeOND Filesystem on Azure CycleCloud Slurm Cluster

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Jerrance_0-1663079177731.png

 

OVERVIEW

Azure CycleCloud (CC) is an enterprise-friendly tool for orchestrating and managing High-Performance Computing (HPC) environments on Azure. With CycleCloud, users can provision infrastructure for HPC systems, deploy familiar HPC schedulers, deploy/mount filesystems and automatically scale the infrastructure to run jobs efficiently at any scale.  

BeeOND ("BeeGFS On Demand", pronounced like the word "beyond") is a per-job BeeGFS parallel filesystem that aggregates the local NVMe/SSDs of Azure VMs into a single filesystem for the duration of the job (NOTE: this is not persistent storage and only exists as long as the job is running...data needs to be staged into and out of BeeOND).  This provides the performance benefit of a fast, shared job scratch without additional cost as the VM local NVMe/SSD is included in the cost of the VM.  The BeeOND filesystem will utilize the Infiniband fabric of our H-series VMs to provide the highest bandwidth (up to 200Gbps HDR) and lowest latency compared to any other storage option.

 

This blog will describe how to automate a BeeOND filesystem with an Azure CycleCloud orchestrated Slurm cluster.  It will demonstrate how to install and configure BeeOND on compute nodes using a Cloud-Init script via CycleCloud.  The process of starting and stopping the BeeOND filesystem for each job is implemented via provided Slurm Prolog and Epilog scripts.  By the end you will be able to add a BeeOND filesystem to your Slurm cluster (NOTE:  creating the Slurm cluster is outside of this scope) with minimal interactions from the users running jobs.

 

 

REQUIREMENTS/VERSIONS:

- CycleCloud server (My CC version is 8.2.2-1902)

- Slurm cluster (My Slurm version is 20.11.7-1 and my CC Slurm release version is 2.6.2)

- Azure H-series Compute VMs (My Compute VMs are HB120rs_v3, each with 2x 900GiB ephemeral NVMe drives)
- My Compute OS is OpenLogic:CentOS-HPC:7_9-gen2:⁠7.9.2022040101

 

 

TL/DR:

- Use this Cloud-Init script in CC to install/configure the compute partition for BeeOnD (NOTE: script is tailored to HBv3 with 2 NVMe drives)
- Provision (ie. start) the BeeOND filesystem using this Slurm Prolog script 

- De-provision (ie. stop) the BeeOND filesystem using this Slurm Epilog script 

 

 

SOLUTION:

  1.  Ensure you have a working CC environment and Slurm cluster
  2.  Copy the cloud-init script from my Git repo and add to the CC Slurm cluster:Jerrance_1-1664371711435.png

     

2.1:  Select the appropriate cluster from your CC cluster list

2.2:  Click "Edit" to update the cluster settings
2.3:  In the popup window click "Cloud-init" from the menu

2.4:  Select your partition to add the BeeOND filesystem (NOTE: the default partition name is hpc but may differ in your cluster)

2.5:  Copy the Cloud-init script from git and paste here
2.6:  Save the settings

 

3.  SSH to your Slurm scheduler node and re-scale the compute nodes to update the settings in Slurm

 

 

sudo /opt/cycle/slurm/cyclecloud_slurm.sh remove_nodes sudo /opt/cycle/slurm/cyclecloud_slurm.sh scale

 

 

 

 

4.  Add the Prolog and Epilog configs to slurm.conf:

 

 

# Download Prolog/Epilog scripts to /sched sudo wget -O /sched/slurm_prolog.sh https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-beeond/slurm-prolog-beeond.sh sudo wget -O /sched/slurm_epilog.sh https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-beeond/slurm-epilog-beeond.sh # add Prolog/Epilog configs to slurm.conf sudo cat <<EOF >>/sched/slurm.conf Prolog=/sched/slurm_prolog.sh Epilog=/sched/slurm_epilog.sh EOF # force cluster nodes to re-read the slurm.conf sudo scontrol reconfig

 

 

 

 

VALIDATE SETUP:

1.  Download the test job from Github:

 

 

wget https://raw.githubusercontent.com/themorey/cyclecloud-scripts/main/slurm-beeond/beeond-test.sbatch

 

 

#!/bin/bash
#SBATCH --job-name=beeond
#SBATCH -N 2
#SBATCH -n 100
#SBATCH -p hpc

logdir="/sched/log"
logfile=$logdir/slurm_beeond.log

#echo "$DATE creating Slurm Job $SLURM_JOB_ID nodefile and starting Beeond" >> $logfile 2>&1
#scontrol show hostnames $SLURM_JOB_NODELIST > nodefile-$SLURM_JOB_ID
#beeond start -n /shared/home/$SLURM_JOB_USER/nodefile-$SLURM_JOB_ID -d /mnt/nvme -c /mnt/beeond -P >> $logfile 2>&1

echo "#####################################################################################"
echo "df -h:   "
df -h
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:   "
beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-ctl --mount=/mnt/beeond --getentryinfo:   "
beegfs-ctl --mount=/mnt/beeond --getentryinfo /mnt/beeond
echo "#####################################################################################"
echo "#####################################################################################"
echo ""
echo "beegfs-net:   "
beegfs-net
#beeond stop -n /shared/home/$SLURM_JOB_USER/nodefile-$SLURM_JOB_ID -L -d -P -c >> $logfile 2>&1

 

2.  Submit the job:  sbatch beeond-test.sbatch

 

3.  When the job completes you will have an output file named slurm-2.out in your home directory (assuming the job # is 2, else          substitute your job # in the filename).  A sample job output will look like this:

#####################################################################################

df -h:

Filesystem         Size  Used Avail Use% Mounted on

devtmpfs           221G     0  221G   0% /dev

tmpfs              221G     0  221G   0% /dev/shm

tmpfs              221G   18M  221G   1% /run

tmpfs              221G     0  221G   0% /sys/fs/cgroup

/dev/sda2           30G   20G  9.6G  67% /

/dev/sda1          494M  119M  375M  25% /boot

/dev/sda15         495M   12M  484M   3% /boot/efi

/dev/sdb1          473G   73M  449G   1% /mnt/resource

/dev/md10          1.8T   69M  1.8T   1% /mnt/nvme

10.40.0.5:/sched    30G   33M   30G   1% /sched

10.40.0.5:/shared  100G   34M  100G   1% /shared

tmpfs               45G     0   45G   0% /run/user/20002

beegfs_ondemand    3.5T  103M  3.5T   1% /mnt/beeond

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=storage:

jm-hpc-pg0-1 [ID: 1]

jm-hpc-pg0-3 [ID: 2]

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --listnodes --nodetype=metadata:

jm-hpc-pg0-1 [ID: 1]

#####################################################################################

#####################################################################################

 

beegfs-ctl --mount=/mnt/beeond --getentryinfo:

Entry type: directory

EntryID: root

Metadata node: jm-hpc-pg0-1 [ID: 1]

Stripe pattern details:

+ Type: RAID0

+ Chunksize: 512K

+ Number of storage targets: desired: 4

+ Storage Pool: 1 (Default)

#####################################################################################

#####################################################################################

 

beegfs-net:

 

mgmt_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: TCP: 1 (172.17.0.1:9008);

 

meta_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: RDMA: 1 (172.16.1.66:9005);

 

storage_nodes

=============

jm-hpc-pg0-1 [ID: 1]

   Connections: RDMA: 1 (172.16.1.66:9003);

jm-hpc-pg0-3 [ID: 2]

   Connections: RDMA: 1 (172.16.1.76:9003);

 

 

CONCLUSION

Creating a fast parallel filesystem on Azure does not have to be difficult nor expensive.  This blog has shown how the installation and configuration of a BeeOND filesystem can be automated for a Slurm cluster (will also work with other cluster types with adaptation of the prolog/epilog configs).  As this is a non-persistent shared job scratch the data should reside on a persistent storage (ie. NFS, Blob) and staged to and from the BeeOND mount (ie. /mnt/beeond per these setup scripts) as part of the job script.

 

 

LEARN MORE

Learn more about Azure Cyclecloud

Read more about Azure HPC + AI

Take the Azure HPC learning path

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.