Lustre on Azure

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Lustre on Azure

 

Azure has Lv2 VM instances which feature NVMe disks that can be used for a Lustre filesystem. This is a cost effective way to provision a high performance filesystem on the Azure. The disks are internal to the physical host and do not have the same SLA as premium disk storage but, when coupled with HSM, can be a fast on-demand high performance filesystem.

 

This guide outlines setting up a Lustre filesystem and PBS cluster with both AzureHPC (scripts to automated deployment using the Azure CLI) and Azure CycleCloud, running the IOR filesystem benchmark, using the HSM capabilities for archival and backup to Azure BLOB Storage and viewing metrics in Log Analytics.

 

Provisioning with AzureHPC

 

First download AzureHPC:

git clone https://github.com/Azure/azurehpc.git

 

Next, setup the environment for you shell:

source azurehpc/install.sh

 

Note: this install.sh file should be "sourced" in each bash session where you want to run the azhpc-* commands (alternatively put in your ~/.bashrc).

 

The AzureHPC project has an example with a Lustre filesystem and a PBS cluster. To clone this you can run:

azhpc-init \
    -c $azhpc_dir/examples/lustre_combined \
    -d <new-directory-name>

 

The example has the following variables that must be set in the config file:

 

Variable Description
resource_group The resource group for the project
storage_account The storage account for HSM
storage_key The storage key for HSM
storage_container The container to use for HSM
log_analytics_lfs_name The name to use in log analytics
log_analytics_workspace The log analytics workspace id
log_analytics_key The log analytics key

 

Note: Macros exist to get the storage_key using sakey.<storage-account-name>, log_analytics_workspace using laworkspace.<resource-group>.<workspace-name> and log_analytics_key using lakey.<resource-group>.<workspace-name>.

Other values for the VM SKU or number of instances to use. This example has a headnode (D16_v3), two compute nodes (D32_v3) and four Lustre nodes (L32_v2). There is also an azurehpc web tool that can be used to view a config file - either click Open and load locally or pass a URL, e.g. the lustre_combined example.

 

clipboard_image_0.png

 

Once the config file is setup you can run:

 

azhpc-build

 

The progress will be displayed as it runs, e.g.

 

paul@nuc:~/Microsoft/azurehpc_projects/lustre_test$ azhpc-build 
You have 2 updates available. Consider updating your CLI installation.
Thu  5 Dec 10:45:13 GMT 2019 : Azure account: AzureCAT-TD HPC (f5a67d06-2d09-4090-91cc-e3298907a021)
Thu  5 Dec 10:45:13 GMT 2019 : creating temp dir - azhpc_install_config
Thu  5 Dec 10:45:13 GMT 2019 : creating ssh keys for hpcadmin
Generating public/private rsa key pair.
Your identification has been saved in hpcadmin_id_rsa.
Your public key has been saved in hpcadmin_id_rsa.pub.
The key fingerprint is:
SHA256:sM+Wb0bByl4EoxrLV6TdkLEADSP/Mj0w94xIopH034M paul@nuc
The key's randomart image is:
+---[RSA 2048]----+
| .. ++. .o       |
|...o ...*.       |
|o ..= o=.*       |
| o ooB=*o =      |
|.  .+E*=So .     |
|    +o.++.o      |
|     . .=o       |
|       ...o      |
|         o.      |
+----[SHA256]-----+
Thu  5 Dec 10:45:13 GMT 2019 : creating resource group
Location    Name
----------  -------------------------
westeurope  paul-azurehpc-lustre-test
Thu  5 Dec 10:45:16 GMT 2019 : creating network

Thu  5 Dec 10:45:23 GMT 2019 : creating subnet compute
AddressPrefix    Name     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -------  --------------------------------  -----------------------------------  -------------------  -------------------------
10.2.0.0/22      compute  Enabled                           Enabled                              Succeeded            paul-azurehpc-lustre-test
Thu  5 Dec 10:45:29 GMT 2019 : creating subnet storage
AddressPrefix    Name     PrivateEndpointNetworkPolicies    PrivateLinkServiceNetworkPolicies    ProvisioningState    ResourceGroup
---------------  -------  --------------------------------  -----------------------------------  -------------------  -------------------------
10.2.4.0/24      storage  Enabled                           Enabled                              Succeeded            paul-azurehpc-lustre-test
Thu  5 Dec 10:45:35 GMT 2019 : creating vmss: compute
Thu  5 Dec 10:45:40 GMT 2019 : creating vm: headnode
Thu  5 Dec 10:45:46 GMT 2019 : creating vmss: lustre
Thu  5 Dec 10:45:52 GMT 2019 : waiting for compute to be created
Thu  5 Dec 10:47:24 GMT 2019 : waiting for headnode to be created
Thu  5 Dec 10:47:26 GMT 2019 : waiting for lustre to be created
Thu  5 Dec 10:48:28 GMT 2019 : getting public ip for headnode
Thu  5 Dec 10:48:29 GMT 2019 : building hostlists
Thu  5 Dec 10:48:33 GMT 2019 : building install scripts
rsync azhpc_install_config to headnode0d5c95.westeurope.cloudapp.azure.com
Thu  5 Dec 10:48:42 GMT 2019 : running the install scripts
Step 0 : install_node_setup.sh (jumpbox_script)
    duration: 20 seconds
Step 1 : disable-selinux.sh (jumpbox_script)
    duration: 1 seconds
Step 2 : nfsserver.sh (jumpbox_script)
    duration: 32 seconds
Step 3 : nfsclient.sh (jumpbox_script)
    duration: 28 seconds
Step 4 : localuser.sh (jumpbox_script)
    duration: 2 seconds
Step 5 : create_raid0.sh (jumpbox_script)
    duration: 21 seconds
Step 6 : lfsrepo.sh (jumpbox_script)
    duration: 1 seconds
Step 7 : lfspkgs.sh (jumpbox_script)
    duration: 221 seconds
Step 8 : lfsmaster.sh (jumpbox_script)
    duration: 25 seconds
Step 9 : lfsoss.sh (jumpbox_script)
    duration: 5 seconds
Step 10 : lfshsm.sh (jumpbox_script)
    duration: 134 seconds
Step 11 : lfsclient.sh (jumpbox_script)
    duration: 117 seconds
Step 12 : lfsimport.sh (jumpbox_script)
    duration: 12 seconds
Step 13 : lfsloganalytics.sh (jumpbox_script)
    duration: 2 seconds
Step 14 : pbsdownload.sh (jumpbox_script)
    duration: 1 seconds
Step 15 : pbsserver.sh (jumpbox_script)
    duration: 61 seconds
Step 16 : pbsclient.sh (jumpbox_script)
    duration: 13 seconds
Step 17 : addmpich.sh (jumpbox_script)
    duration: 4 seconds
Thu  5 Dec 11:00:23 GMT 2019 : cluster ready

 

Once complete you can connect to the headnode with:

 

azhpc-connect -u hpcuser headnode

 

Provisioning with Azure CycleCloud

 

This section walks you through setting up a Lustre Filesystem and an autoscaling PBSPro cluster where the Lustre client is set up.  This uses an Azure CycleCloud project which is available here.

 

Installing the Lustre Project and Templates

 

To follow these instructions you will need to have Azure CycleCloud set up and run from somewhere with git and the Azure CycleCloud CLI installed.

 

First checkout the cyclecloud-lfs repository:

 

git clone https://github.com/edwardsp/cyclecloud-lfs

 

This repository contains the Azure CycleCloud project and templates. There is an lfs template for the Lustre filesystem and a pbspro-lfs template which is a modified version of the official pbspro template (from here). The pbspro-lfs template is included in the github project to test the Lustre filesystem. Instructions for adding the Lustre client to another template can be found here.

 

The commands below will upload the project and import the templates to Azure CycleCloud.

 

cd cyclecloud-lfs
cyclecloud project upload <container>
cyclecloud import_template -f templates/lfs.txt
cyclecloud import_template -f templates/pbspro-lfs.txt

 

Note: replace <container> with the Azure CycleCloud "locker" to use. You can list your lockers by running cyclecloud locker list.

 

Once these commands have been run you will be able to see the new templates in you Azure CycleCloud web interface:

 

cyclecloud-lfs-templates.png

 

Creating the Lustre Cluster

 

First create the lfs cluster and first choose a name:

 

cyclecloud-lfs-about.png

 

Note: This name will later be used in the PBS cluster to reference this filesystem.

 

Click Next to move to the Required Settings. Here you can choose the region and VM types. Only choose L_v2 instance type and it is not recommended to go beyond L32_v2 - the network thoughput does not scale linearly beyond this size. All NVME disks will be combined in a RAID 0 for the OST in the virtual machine.

 

clipboard_image_3.png

 

Choose the Base OS in Advanced Settings. This will determine which version of Lustre to use. The scripts are set up to use the Whamcloud repository for Lustre and so RPMs for Lustre 2.10 is only available up to CentOS 7.6 and Lustre 2.12 is available for CentOS 7.7.

 

Note: Both the server and client Lustre versions need to match.

 
clipboard_image_4.png

 

In Lustre Settings you can choose the Lustre version and number of Additional OSS nodes. The number of OSS nodes is chosen here and cannot be modified without recreating the filesystem.

To use HSM you must enable the checkbox and provide details for a Storage Account, Storage Key and Storage Container. All files in the container that is selected will be imported into Lustre when the filesystem is started.

 

Note: This only populated the metadata and files are downloaded on-demand as they are accessed. Alternatively, they can be restored using the lfs hsm_restore command.

 

To use Log Analytics you must enable the checkbox and provide details for the Name, Log Analytics Workspace and Log Analytics Key. The Name is the log name to use for the metrics.

 

clipboard_image_5.png

 

Now click Save and start the cluster.

 

Creating the PBS Cluster

 

To test the Lustre filesystem we will create a pbspro-lfs cluster. Name the cluster, select the region, SKUs, autoscale settings and choose a subnet with access to the Lustre cluster. In the Advanced Settings make sure you know which version of CentOS you are using. At time of writing Cycle CentOS 7 is version 7.6 but you may want to explcitly set the version with a custom image as the Azure CycleCloud version may be updated.

 

In the Lustre Settings there will be a dropdown menu showing the Lustre Clusters that are available for you to choose. Make sure the Lustre Version is correct for both the OS that is chosen and so it matches the Luster cluster. Finally, choose the path for lustre to be mounted on all the clients and click Save.

 

clipboard_image_6.png

 

Once the Lustre Cluster is running you can start this cluster.

 

Lustre Performance

 

We will use ior to test the performance. Either the AzureHPC or CycleCloud version could be used although commands will change slightly depending on the image and OS version being used. The commands below relate to the lustre_combined AzureHPC example.

 

First, connect to the headnode:

 

azhpc-connect -u hpcuser headnode

 

We will be compiling ior and so this requires the MPI compiler on the headnode:

 

sudo yum -y install mpich-devel

 

Now, download and compile ior:

 

module load mpi/mpich-3.0-x86_64
wget https://github.com/hpc/ior/releases/download/3.2.1/ior-3.2.1.tar.gz
tar zxvf ior-3.2.1.tar.gz
cd ior-3.2.1
./configure --prefix=$HOME/ior
make
make install

 

Move to the lustre filesystem:

 

cd /lustre

 

Create a PBS job file, e.g. run_ior.pbs:

 

#!/bin/bash

source /etc/profile
module load mpi/mpich-3.0-x86_64

cd $PBS_O_WORKDIR

NP=$(wc -l <$PBS_NODEFILE)
NODES=$(sort -u $PBS_NODEFILE | wc -l)
PPN=$(($NP / $NODES))

TIMESTAMP=$(date +"%Y-%m-%d_%H-%M-%S")

mpirun -np $NP -machinefile $PBS_NODEFILE \
    $HOME/ior/bin/ior -a POSIX -v -z -i 1 -m -d 1 -B -e -F -r -w -t 32m -b 4G \
    -o $PWD/test.$TIMESTAMP \
    | tee ior-${NODES}x${PPN}.$TIMESTAMP.log

 

An ior benchmark can be submitted as follows:

 

client_nodes=2
procs_per_node=32
qsub -lselect=${client_nodes}:ncpus=${procs_per_node}:mpiprocs=${procs_per_node},place=scatter:excl run_ior.pbs

 

Here are results testing the bandwidth of the Lustre filesystem scaling from 1 to 16 OSS VMs.

 

clipboard_image_7.png

 

In each run the same number of client VMs were used as there are OSS VMs and 32 processes were run on each client VM. Each client VM is a D32_v3 which has expected bandwidth of 16000 Mbps (see here) and each OSS VM is an L32_v2 which has the expected bandwidth of 12800 Mbps (see here). This should mean that a single client should be able to saturate the bandwidth of one OSS. The max network is the expected bandwidth from the OSS multipled by the number of OSS VMs.

 

Using HSM

 

The AzureHPC examples and the Azure CycleCloud templates set up HSM on Lustre and will import the storage container when the filesystem is created. Only the metadata is read and so files will be downloaded on-demand as they are accessed. But, other than on-demand downloads all the other commands for archival are not automatic.

 

The copytool for Azure is available here. This copytool supports users, groups and UNIX file permissions which are added as meta-data to the files stored in Azure BLOB storage.

 

HSM Commands

 

The HSM actions are available with the lfs command. All the commands below will work with multiple files as arguments.

 

Achive

 

The lfs hsm_archive command will copy the file to Azure BLOB storage. Example usage:

 

sudo lfs hsm_archive myfile

 

Release

 

The lfs hsm_release command will release an archived file from the Lustre file system. It will no longer take up space in Lustre but it will still appear to be in the filesystem and when opened it will be re-downloaded. Example usage:

 

sudo lfs hsm_release myfile

 

Remove

 

The lfs hsm_remove command will delete an archived file from the archive.

 

State

 

The lfs hsm_state command shows the state of the file in the filesystem. This is output for a file that isn't archived:

 

$ sudo lfs hsm_state myfile 
myfile: (0x00000000)

 

This is output for a file that is archived:

 

$ sudo lfs hsm_state myfile 
myfile: (0x0000000d) exists archived, archive_id:1

 

This is output for a file that is archived and released (i.e. in storage but not taking up space in the filesystem):

 

$ sudo lfs hsm_state myfile 
myfile: (0x0000000d) released exists archived, archive_id:1

 

Action

 

The lfs hsm_action command displays the current HSM request for a given file. This is most useful when checking the progress on files being archived or restored. When there is no ongoing or pending HSM request it will display NOOP for the file.

 

Rehydrating the whole filesystem from BLOB storage

 

In certain cases you may want to restore all the released (or imported) files into the filesystem. This would be better in cases where you know all the files will be required and so the application will not wait while each file is retrieved separately. This can be started with the following command:

 

cd <lustre_root>
find . -type f -print0 | xargs -r0 -L 50 sudo lfs hsm_restore

 

The progress of the files can be checked with sudo lfs hsm_action and to find how many files are left to be restored the following command can be used:

 

cd <lustre_root>
find . -type f -print0 \
    | xargs -r0 -L 50 sudo lfs hsm_restore \
    | grep -v NOOP \
    | wc -l

 

Viewing Lustre Metrics in Log Analytics

 

Each lustre VM will log the following metrics every sixty seconds if log analytics is enabled:

  • Load average
  • Kilobytes free
  • Network bytes sent
  • Network bytes received

 

The data can be viewed in the portal by selecting Monitor and then Logs. Here is an example query:

 

<log-name>_CL
| summarize max(loadavg_d),max(bytessend_d),max(bytesrecv_d) by bin(TimeGenerated,1m), hostname_s
| render timechart

 

Note: substitute <log-name> for the name you chose.

 
clipboard_image_8.png

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.