This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Lustre on Azure

Azure has Lv2 VM instances which feature NVMe disks that can be used for a Lustre filesystem. This is a cost effective way to provision a high performance filesystem on the Azure. The disks are internal to the physical host and do not have the same SLA as premium disk storage but, when coupled with HSM, can be a fast on-demand high performance filesystem.

This guide outlines setting up a Lustre filesystem and PBS cluster with both AzureHPC (scripts to automated deployment using the Azure CLI) and Azure CycleCloud, running the IOR filesystem benchmark, using the HSM capabilities for archival and backup to Azure BLOB Storage and viewing metrics in Log Analytics.

Provisioning with AzureHPC

First download AzureHPC:

git clone https://github.com/Azure/azurehpc.git

Next, setup the environment for you shell:

source azurehpc/install.sh

Note: this install.sh file should be "sourced" in each bash session where you want to run the azhpc-* commands (alternatively put in your ~/.bashrc).

The AzureHPC project has an example with a Lustre filesystem and a PBS cluster. To clone this you can run:

azhpc-init \
    -c $azhpc_dir/examples/lustre_combined \
    -d <new-directory-name>

The example has the following variables that must be set in the config file:

Variable	Description
resource_group	The resource group for the project
storage_account	The storage account for HSM
storage_key	The storage key for HSM
storage_container	The container to use for HSM
log_analytics_lfs_name	The name to use in log analytics
log_analytics_workspace	The log analytics workspace id
log_analytics_key	The log analytics key

Note: Macros exist to get the storage_key using sakey.<storage-account-name>, log_analytics_workspace using laworkspace.<resource-group>.<workspace-name> and log_analytics_key using lakey.<resource-group>.<workspace-name>.

Other values for the VM SKU or number of instances to use. This example has a headnode (D16_v3), two compute nodes (D32_v3) and four Lustre nodes (L32_v2). There is also an azurehpc web tool that can be used to view a config file - either click Open and load locally or pass a URL, e.g. the lustre_combined example.

Once the config file is setup you can run:

azhpc-build

The progress will be displayed as it runs, e.g.

paul@nuc:~/Microsoft/azurehpc_projects/lustre_test$ azhpc-build 
You have 2 updates available. Consider updating your CLI installation.
Thu  5 Dec 10:45:13 GMT 2019 : Azure account: AzureCAT-TD HPC (f5a67d06-2d09-4090-91cc-e3298907a021)
Thu  5 Dec 10:45:13 GMT 2019 : creating temp dir - azhpc_install_config
Thu  5 Dec 10:45:13 GMT 2019 : creating ssh keys for hpcadmin
Generating public/private rsa key pair.
Your identification has been saved in hpcadmin_id_rsa.
Your public key has been saved in hpcadmin_id_rsa.pub.
The key fingerprint is:
SHA256:sM+Wb0bByl4EoxrLV6TdkLEADSP/Mj0w94xIopH034M paul@nuc
The key's randomart image is:
+---[RSA 2048]----+
| .. ++. .o       |
|...o ...*.       |
|o ..= o=.*       |
| o ooB=*o =      |
|.  .+E*=So .     |
|    +o.++.o      |
|     . .=o       |
|       ...o      |
|         o.      |
+----[SHA256]-----+
Thu  5 Dec 10:45:13 GMT 2019 : creating resource group
Location    Name
----------  -------------------------
westeurope  paul-azurehpc-lustre-test
Thu  5 Dec 10:45:16 GMT 2019 : creating network

Thu  5 Dec 10:45:23 GMT 2019 : creating subnet compute
AddressPrefix    Name     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -------  --------------------------------  -----------------------------------  -------------------  -------------------------
10.2.0.0/22      compute  Enabled                           Enabled                              Succeeded            paul-azurehpc-lustre-test
Thu  5 Dec 10:45:29 GMT 2019 : creating subnet storage
AddressPrefix    Name     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -------  --------------------------------  -----------------------------------  -------------------  -------------------------
10.2.4.0/24      storage  Enabled                           Enabled                              Succeeded            paul-azurehpc-lustre-test
Thu  5 Dec 10:45:35 GMT 2019 : creating vmss: compute
Thu  5 Dec 10:45:40 GMT 2019 : creating vm: headnode
Thu  5 Dec 10:45:46 GMT 2019 : creating vmss: lustre
Thu  5 Dec 10:45:52 GMT 2019 : waiting for compute to be created
Thu  5 Dec 10:47:24 GMT 2019 : waiting for headnode to be created
Thu  5 Dec 10:47:26 GMT 2019 : waiting for lustre to be created
Thu  5 Dec 10:48:28 GMT 2019 : getting public ip for headnode
Thu  5 Dec 10:48:29 GMT 2019 : building hostlists
Thu  5 Dec 10:48:33 GMT 2019 : building install scripts
rsync azhpc_install_config to headnode0d5c95.westeurope.cloudapp.azure.com
Thu  5 Dec 10:48:42 GMT 2019 : running the install scripts
Step 0 : install_node_setup.sh (jumpbox_script)
    duration: 20 seconds
Step 1 : disable-selinux.sh (jumpbox_script)
    duration: 1 seconds
Step 2 : nfsserver.sh (jumpbox_script)
    duration: 32 seconds
Step 3 : nfsclient.sh (jumpbox_script)
    duration: 28 seconds
Step 4 : localuser.sh (jumpbox_script)
    duration: 2 seconds
Step 5 : create_raid0.sh (jumpbox_script)
    duration: 21 seconds
Step 6 : lfsrepo.sh (jumpbox_script)
    duration: 1 seconds
Step 7 : lfspkgs.sh (jumpbox_script)
    duration: 221 seconds
Step 8 : lfsmaster.sh (jumpbox_script)
    duration: 25 seconds
Step 9 : lfsoss.sh (jumpbox_script)
    duration: 5 seconds
Step 10 : lfshsm.sh (jumpbox_script)
    duration: 134 seconds
Step 11 : lfsclient.sh (jumpbox_script)
    duration: 117 seconds
Step 12 : lfsimport.sh (jumpbox_script)
    duration: 12 seconds
Step 13 : lfsloganalytics.sh (jumpbox_script)
    duration: 2 seconds
Step 14 : pbsdownload.sh (jumpbox_script)
    duration: 1 seconds
Step 15 : pbsserver.sh (jumpbox_script)
    duration: 61 seconds
Step 16 : pbsclient.sh (jumpbox_script)
    duration: 13 seconds
Step 17 : addmpich.sh (jumpbox_script)
    duration: 4 seconds
Thu  5 Dec 11:00:23 GMT 2019 : cluster ready

Once complete you can connect to the headnode with:

azhpc-connect -u hpcuser headnode

Provisioning with Azure CycleCloud

This section walks you through setting up a Lustre Filesystem and an autoscaling PBSPro cluster where the Lustre client is set up. This uses an Azure CycleCloud project which is available here.

Installing the Lustre Project and Templates

To follow these instructions you will need to have Azure CycleCloud set up and run from somewhere with git and the Azure CycleCloud CLI installed.

First checkout the cyclecloud-lfs repository:

git clone https://github.com/edwardsp/cyclecloud-lfs

This repository contains the Azure CycleCloud project and templates. There is an lfs template for the Lustre filesystem and a pbspro-lfs template which is a modified version of the official pbspro template (from here). The pbspro-lfs template is included in the github project to test the Lustre filesystem. Instructions for adding the Lustre client to another template can be found here.

The commands below will upload the project and import the templates to Azure CycleCloud.

cd cyclecloud-lfs
cyclecloud project upload <container>
cyclecloud import_template -f templates/lfs.txt
cyclecloud import_template -f templates/pbspro-lfs.txt

Note: replace <container> with the Azure CycleCloud "locker" to use. You can list your lockers by running cyclecloud locker list.

Once these commands have been run you will be able to see the new templates in you Azure CycleCloud web interface:

Creating the Lustre Cluster

First create the lfs cluster and first choose a name:

Note: This name will later be used in the PBS cluster to reference this filesystem.

Click Next to move to the Required Settings. Here you can choose the region and VM types. Only choose L_v2 instance type and it is not recommended to go beyond L32_v2 - the network thoughput does not scale linearly beyond this size. All NVME disks will be combined in a RAID 0 for the OST in the virtual machine.

Choose the Base OS in Advanced Settings. This will determine which version of Lustre to use. The scripts are set up to use the Whamcloud repository for Lustre and so RPMs for Lustre 2.10 is only available up to CentOS 7.6 and Lustre 2.12 is available for CentOS 7.7.

Note: Both the server and client Lustre versions need to match.

In Lustre Settings you can choose the Lustre version and number of Additional OSS nodes. The number of OSS nodes is chosen here and cannot be modified without recreating the filesystem.

To use HSM you must enable the checkbox and provide details for a Storage Account, Storage Key and Storage Container. All files in the container that is selected will be imported into Lustre when the filesystem is started.

Note: This only populated the metadata and files are downloaded on-demand as they are accessed. Alternatively, they can be restored using the lfs hsm_restore command.

To use Log Analytics you must enable the checkbox and provide details for the Name, Log Analytics Workspace and Log Analytics Key. The Name is the log name to use for the metrics.

Now click Save and start the cluster.

Creating the PBS Cluster

To test the Lustre filesystem we will create a pbspro-lfs cluster. Name the cluster, select the region, SKUs, autoscale settings and choose a subnet with access to the Lustre cluster. In the Advanced Settings make sure you know which version of CentOS you are using. At time of writing Cycle CentOS 7 is version 7.6 but you may want to explcitly set the version with a custom image as the Azure CycleCloud version may be updated.

In the Lustre Settings there will be a dropdown menu showing the Lustre Clusters that are available for you to choose. Make sure the Lustre Version is correct for both the OS that is chosen and so it matches the Luster cluster. Finally, choose the path for lustre to be mounted on all the clients and click Save.

Once the Lustre Cluster is running you can start this cluster.

Lustre Performance

We will use ior to test the performance. Either the AzureHPC or CycleCloud version could be used although commands will change slightly depending on the image and OS version being used. The commands below relate to the lustre_combined AzureHPC example.

First, connect to the headnode:

azhpc-connect -u hpcuser headnode

We will be compiling ior and so this requires the MPI compiler on the headnode:

sudo yum -y install mpich-devel

Now, download and compile ior:

module load mpi/mpich-3.0-x86_64
wget https://github.com/hpc/ior/releases/download/3.2.1/ior-3.2.1.tar.gz
tar zxvf ior-3.2.1.tar.gz
cd ior-3.2.1
./configure --prefix=$HOME/ior
make
make install

Move to the lustre filesystem:

cd /lustre

Create a PBS job file, e.g. run_ior.pbs:

#!/bin/bash

source /etc/profile
module load mpi/mpich-3.0-x86_64

cd $PBS_O_WORKDIR

NP=$(wc -l <$PBS_NODEFILE)
NODES=$(sort -u $PBS_NODEFILE | wc -l)
PPN=$(($NP / $NODES))

TIMESTAMP=$(date +"%Y-%m-%d_%H-%M-%S")

mpirun -np $NP -machinefile $PBS_NODEFILE \
    $HOME/ior/bin/ior -a POSIX -v -z -i 1 -m -d 1 -B -e -F -r -w -t 32m -b 4G \
    -o $PWD/test.$TIMESTAMP \
    | tee ior-${NODES}x${PPN}.$TIMESTAMP.log

An ior benchmark can be submitted as follows:

client_nodes=2
procs_per_node=32
qsub -lselect=${client_nodes}:ncpus=${procs_per_node}:mpiprocs=${procs_per_node},place=scatter:excl run_ior.pbs

Here are results testing the bandwidth of the Lustre filesystem scaling from 1 to 16 OSS VMs.

In each run the same number of client VMs were used as there are OSS VMs and 32 processes were run on each client VM. Each client VM is a D32_v3 which has expected bandwidth of 16000 Mbps (see here) and each OSS VM is an L32_v2 which has the expected bandwidth of 12800 Mbps (see here). This should mean that a single client should be able to saturate the bandwidth of one OSS. The max network is the expected bandwidth from the OSS multipled by the number of OSS VMs.

Using HSM

The AzureHPC examples and the Azure CycleCloud templates set up HSM on Lustre and will import the storage container when the filesystem is created. Only the metadata is read and so files will be downloaded on-demand as they are accessed. But, other than on-demand downloads all the other commands for archival are not automatic.

The copytool for Azure is available here. This copytool supports users, groups and UNIX file permissions which are added as meta-data to the files stored in Azure BLOB storage.

HSM Commands

The HSM actions are available with the lfs command. All the commands below will work with multiple files as arguments.

Achive

The lfs hsm_archive command will copy the file to Azure BLOB storage. Example usage:

sudo lfs hsm_archive myfile

Release

The lfs hsm_release command will release an archived file from the Lustre file system. It will no longer take up space in Lustre but it will still appear to be in the filesystem and when opened it will be re-downloaded. Example usage:

sudo lfs hsm_release myfile

Remove

The lfs hsm_remove command will delete an archived file from the archive.

State

The lfs hsm_state command shows the state of the file in the filesystem. This is output for a file that isn't archived:

$ sudo lfs hsm_state myfile 
myfile: (0x00000000)

This is output for a file that is archived:

$ sudo lfs hsm_state myfile 
myfile: (0x0000000d) exists archived, archive_id:1

This is output for a file that is archived and released (i.e. in storage but not taking up space in the filesystem):

$ sudo lfs hsm_state myfile 
myfile: (0x0000000d) released exists archived, archive_id:1

Action

The lfs hsm_action command displays the current HSM request for a given file. This is most useful when checking the progress on files being archived or restored. When there is no ongoing or pending HSM request it will display NOOP for the file.

Rehydrating the whole filesystem from BLOB storage

In certain cases you may want to restore all the released (or imported) files into the filesystem. This would be better in cases where you know all the files will be required and so the application will not wait while each file is retrieved separately. This can be started with the following command:

cd <lustre_root>
find . -type f -print0 | xargs -r0 -L 50 sudo lfs hsm_restore

The progress of the files can be checked with sudo lfs hsm_action and to find how many files are left to be restored the following command can be used:

cd <lustre_root>
find . -type f -print0 \
    | xargs -r0 -L 50 sudo lfs hsm_restore \
    | grep -v NOOP \
    | wc -l

Viewing Lustre Metrics in Log Analytics

Each lustre VM will log the following metrics every sixty seconds if log analytics is enabled:

Load average
Kilobytes free
Network bytes sent
Network bytes received

The data can be viewed in the portal by selecting Monitor and then Logs. Here is an example query:

<log-name>_CL
| summarize max(loadavg_d),max(bytessend_d),max(bytesrecv_d) by bin(TimeGenerated,1m), hostname_s
| render timechart

Note: substitute <log-name> for the name you chose.