Running long HPC jobs on Azure with checkpointing using LAMMPS, NAMD and Gromacs

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Some HPC simulations take a long time to run and run over multiple nodes (up to days and weeks). When running these long jobs, failure rate and availability of the underlying infrastructure may become an issue. One solution for recovering from these failures is using checkpointing during the simulation to provide a starting point in case the simulation is terminated for any reason.

When running compute jobs in Azure, most issues from running on bare metal compute nodes still remain. Hardware can and will fail, memory can have issues, and the InfiniBand network may encounter problems. On top of that, due to security and functional updates the (virtualization) software stack may not stay static throughout the duration of the job.

A way to prevent the impact of the potential failures is to save intermediate results, in such a way that a calculation can be restarted from that result point. The creation of this intermediate result is called creating a checkpoint.

Checkpointing can be implemented at different levels: within the applications and at the OS level. If available, the checkpointing within applications are the most stable, and many applications do support checkpointing natively. Examples being described here are .

One significant downside of checkpointing is the amount of I/O that is added to the job during a checkpoint, and the incurred job delay during this writing. Therefor there needs to be a good balance made for the time between checkpoints and the risk of job failure.

Summary

The actual amount of overhead of checkpointing is dependent on the size of the model in memory and the resulting file size that has to be written to disk. For smaller models this file size can be in the range of hundreds of megabytes and doing a single checkpoint per hour may not even be visible in the simulation speed. For larger models, this will be in the range of Gigabytes and here a reasonable fast IO system will allow to limit the overhead to stay within <5% range. Depending on the total runtime of the simulation it makes sense to implement checkpointing on an hourly up to daily interval.

Cluster setup

In the examples below, an Azure CycleCloud managed IBM LSF cluster is used. All the jobs are submitted from a regular user account, which has a shared homedir, hosted on Azure Netapp Files. The base configuration of this cluster is set up using AzureHPC (https://github.com/Azure/azurehpc).

 

In this setup the homedir of the user is used for all job storage, this includes job files, input files, output files and checkpoint files. All applications were compiled and installed using  EasyBuild (http://easybuilders.github.io/easybuild/) framework.

Even though this paper focusses on node failures and preventive measures at job level for this; another potential failure point is the LSF master node of the cluster. Under normal circumstances a failure on the master node will not influence the job run. That being said, one of the options is to have a high availability (HA) setup of a set of master nodes. .

LAMMPS

This Molecular Dynamics application is written at Sandia labs (https://lammps.sandia.gov/) and is open source. It supports MPI and can run large and long-lasting jobs. LAMMPS does support checkpointing natively and this can be enabled by using the “restart” command in the input file.

For this example, a (simple) Lennard-Jones based input file is used:

# 3d Lennard-Jones melt

variable        x index 1

variable        y index 1

variable        z index 1

variable        xx equal 400*$x

variable        yy equal 400*$y

variable        zz equal 400*$z

units           lj

atom_style      atomic

lattice         fcc 0.8442

region          box block 0 ${xx} 0 ${yy} 0 ${zz}

create_box      1 box

create_atoms    1 box

mass            1 1.0

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5

pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin

neigh_modify    delay 0 every 20 check no

fix             1 all nve

restart         1000 lj.restart

run             100000 every 1000 NULL

 

The jobscript creates a working directory based on the job-id, copies in the used executable and the input file and starts a mpirun based on the amount of cores assigned to the job:

#!/bin/bash

 

mkdir $LSB_JOBID

cd $LSB_JOBID

 

cp ~/apps/lammps-7Aug19-mpi/src/lmp_mpi .

cp ../in.lj .

 

module load foss/2018b

mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj

 

The job is submitted using LSF’s bsub command and 4 HC nodes are requested, using a total of 176 cores:

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./lammpsjob.sh

 

Once the nodes are acquired and started up, the simulation kicks off. The job-id that was given is 25, so we can now look at the job directory to follow the progress. Due to the restart option in the input file, a request is being made every 1000 simulation steps to create a for all the 256 million atoms. In this case this results in a 21GB file being created roughly every half hour.

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 12:20 lj.restart.1000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 12:51 lj.restart.2000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 13:22 lj.restart.3000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 13:53 lj.restart.4000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 14:24 lj.restart.5000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 14:56 lj.restart.6000

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 15:26 lj.restart.7000

 

Now let’s assume that the job exited due to some failure ().

To restart the simulation, a modified input file has to be made; mostly because the read_restart and the create_box command conflict: they will both try to create the simulation space with the atoms. A working restart file is the following:

# 3d Lennard-Jones melt

read_restart    lj.restart.*

velocity        all create 1.44 87287 loop geom

pair_style      lj/cut 2.5

pair_coeff      1 1 1.0 1.0 2.5

neighbor        0.3 bin

neigh_modify    delay 0 every 20 check no

fix             1 all nve

restart         1000 lj.restart

run             100000 upto every 1000 NULL

 

First, the read_restart is defined; it will look for the latest checkpoint file and use that to recreate the atoms. Next the run command is modified to include upto; this will tell the run to not restart the simulation and run 100000 steps additional to the restart state, but to only run up to the initially defined endpoint.

Since we are already in a working directory with all the required information, we also need to change the job submission script to look something like this:

#!/bin/bash

cd ~/jobs/25

module load foss/2018b

mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj.restart

 

And we can submit this again using LSF’s bsub:

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./restart-lammpsjob.sh

 

This will start a new job, that takes the latest checkpoint file and continues the simulating from the defined checkpoint.

Another way of implementing checkpointing that uses significantly less storage is restarting based on 2 restart files. In this version we write quite some data every half hour to store the checkpoint files. This can be optimized, by using the following restart entries in the input file:

restart         1000 lj.restart.1 lj.restart.2

 

When restarting from this state, check which file was written latest and use this specifically when adding the read_restart entry into the input file. Also do not forget to add the upto entry. After resubmitting the simulation, it will continue from the last checkpoint.

To quantify the checkpoint overhead; multiple jobs were submitted, each scheduled to simulate 200 steps and lasting approximately 1 hour and using 176 cores. One of the jobs is doing a single checkpoint at 1400 steps minutes, one is doing 2 checkpoints, each at 800 steps and one has no checkpointing enabled.

Checkpoint/run

Runtime

Steps

Lost time

Overhead

Extrapolated for 24h job

0

3469 sec

 

0

0%

0%

1

3630 sec

2000

161 sec

4,6%

0,18%

2

3651

2000

182 sec 

5,2%

0,21%

 

The checkpoint overhead for this model is quite small with ~5% impact for hourly checkpoint. The combined checkpoint file size is 21GB, which is being written during each checkpoint.

NAMD

A second example of a molecular dynamic application is NAMD. This application is written and maintained by the University of Illinois (https://www.ks.uiuc.edu/Research/namd/). This application is open source, even though registration is required.

NAMD also supports checkpointing natively. For this demonstration the STMV benchmark is used which models about 1M atoms.

To enable checkpointing the following lines need to be added to the stmv.namd input file:

outputName          ./output

outputEnergies      20

outputTiming        20

numsteps            21600

restartName         ./checkpoint

restartFreq         100

restartSave         yes

 

And that input file can be used by the following jobscript:

#!/bin/bash

 

mkdir $LSB_JOBID

cd $LSB_JOBID

 

cp ../stmv/* .

 

env

 

module avail

module load NAMD/2.13-foss-2018b-mpi

module list

mpirun -n $LSB_DJOB_NUMPROC namd2 stmv.namd

 

The job is submitted with the following LSF submit script:

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./namd-mpi-job

 

Once started, this job will create checkpoint files in the working directory:

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9500.coor

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9500.vel

-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9500.xsc

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9600.coor

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9600.vel

-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9600.xsc

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9700.coor

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9700.vel

-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9700.xsc

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9800.coor

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9800.vel

-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9800.xsc

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9900.coor

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9900.vel

-rw-r--r--.  1 ccadmin ccadmin  258 Jan 15 17:59 checkpoint.9900.xsc

 

If at this point the job fails, the checkpoint files can be used to restart the job. For this the input file needs to be adjusted. Looking for the latest checkpoint file available, the timestep is 9900. Now in the input file the following two changes are needed, first tell the simulation to restart at time 9900 and second to read and initialize the simulation from the checkpoint files of checkpoint.9900.  

outputName          ./output

outputEnergies      20

outputTiming        20

firsttimestep       9900

numsteps            21600

restartName         ./checkpoint

restartFreq         100

restartSave         yes

reinitatoms        ./checkpoint.9900

 

After resubmitting the job with bsub, the simulation will continue from the checkpoint and stop at the original intended step, see example output:

TCL: Reinitializing atom data from files with basename ./checkpoint.9900

Info: Reading from binary file ./checkpoint.9900.coor

Info: Reading from binary file ./checkpoint.9900.vel

[…]

TIMING: 21600  CPU: 1882.08, 0.159911/step  Wall: 1882.08, 0.159911/step, 0 hours remaining, 1192.902344 MB of memory in use.

ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG

 

ENERGY:   21600    366977.1834    278844.5518     81843.4582      5043.3226       -4523413.8827    385319.1671         0.0000         0.0000    942875.0190       -2462511.1806       296.5584  -3405386.1995  -2454036.3329       296.7428             15.0219        19.7994  10190037.8159        -1.4609        -1.0415

 

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 21600

WRITING COORDINATES TO RESTART FILE AT STEP 21600

FINISHED WRITING RESTART COORDINATES

WRITING VELOCITIES TO RESTART FILE AT STEP 21600

FINISHED WRITING RESTART VELOCITIES

WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 21600

WRITING COORDINATES TO OUTPUT FILE AT STEP 21600

WRITING VELOCITIES TO OUTPUT FILE AT STEP 21600

====================================================

 

WallClock: 1894.207520  CPUTime: 1894.207520  Memory: 1192.902344 MB

[Partition 0][Node 0] End of program

 

To quantify the checkpoint overhead; multiple jobs were submitted, each lasting approximately 1 hour and using 176 cores. One of the jobs is doing a single checkpoint at 40000 steps minutes, one is doing 2 checkpoints, each at 20000 steps and one has no checkpointing enabled.

Checkpoint/run

Runtime

Steps

Lost time

Overhead

Extrapolated for 24h job

0

3300 sec

72000

0

0%

0%

1

3395 sec

72000

95 sec

2.8%

0,11%

2

3417 sec

72000

117 sec 

3,5%

0,13%

The checkpoint overhead in this model is ~3%; with a combined checkpoint file size of 50MB.

Gromacs

Another Molecular Dynamics package is Gromacs, originally started at the University of Groningen in the Netherlands, it now has its home at Upsalla University in Sweden. This is another open source package. Gromacs can be found at http://www.gromacs.org.

The dataset used for this work is the open benchmark benchPEP (https://www.mpibpc.mpg.de/grubmueller/bench) . This dataset has ~12 million atoms and is roughly 293 MB in size.

Gromacs supports checkpointing natively and to enable it, use the following option on the mdrun command line:

mpirun gmx_mpi mdrun -s benchPEP.tpr -nsteps -1 -maxh 1.0 -cpt 10

 

The –cpt 10 will initiate a checkpoint every 10 minutes. This is shown in the log files as:

step  320: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32565.0 M-cycles

step  480: timed with pme grid 288 288 288, coulomb cutoff 1.302: 34005.8 M-cycles

step  640: timed with pme grid 256 256 256, coulomb cutoff 1.465: 43496.5 M-cycles

step  800: timed with pme grid 280 280 280, coulomb cutoff 1.339: 35397.0 M-cycles

step  960: timed with pme grid 288 288 288, coulomb cutoff 1.302: 33611.2 M-cycles

step 1120: timed with pme grid 300 300 300, coulomb cutoff 1.250: 32671.0 M-cycles

step 1280: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31965.1 M-cycles

              optimal pme grid 320 320 320, coulomb cutoff 1.200

Writing checkpoint, step 3200 at Thu Mar 19 08:24:14 2020

 

After killing a single thread on one of the machines, the job is stopped:

--------------------------------------------------------------------------

mpirun noticed that process rank 71 with PID 17851 on node ip-0a000214 exited on signal 9 (Killed).

--------------------------------------------------------------------------

 

This leaves us with a state.cpt file that was written during the last checkpoint. To restart from this state file, the following mdrun command can be used:

mpirun gmx_mpi mdrun -s benchPEP.tpr -cpi state.cpt


When starting this, the computation continues from the checkpoint:

step 3520: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31738.8 M-cycles

step 3680: timed with pme grid 288 288 288, coulomb cutoff 1.310: 33970.4 M-cycles

step 3840: timed with pme grid 256 256 256, coulomb cutoff 1.474: 44232.6 M-cycles

step 4000: timed with pme grid 280 280 280, coulomb cutoff 1.348: 36289.3 M-cycles

step 4160: timed with pme grid 288 288 288, coulomb cutoff 1.310: 34398.6 M-cycles

step 4320: timed with pme grid 300 300 300, coulomb cutoff 1.258: 32844.2 M-cycles

step 4480: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32005.9 M-cycles

              optimal pme grid 320 320 320, coulomb cutoff 1.200

Writing checkpoint, step 8080 at Thu Mar 19 09:03:05 2020

 

And the new checkpoint can be seen:

-rw-rw-r--.  1 ccadmin ccadmin 287M Mar 19 09:03 state.cpt

-rw-rw-r--.  1 ccadmin ccadmin 287M Mar 19 09:03 state_prev.cpt

 

To verify the checkpoint overhead; multiple jobs were submitted, each lasting 1 hour (using -maxh 1.0) and using 176 cores. One of the jobs is doing a single checkpoint after 45 minutes, one is doing 2 checkpoints, each  at 20 minutes and one has no checkpointing enabled.

Checkpoint/run

Steps

Lost steps

Overhead

Extrapolated for 24h job

0

38000

0

0%

0%

1

38000

0

0%

0%

2

36640

1360      

3,6%

0,16%

 

As can be seen, the overhead of doing a single checkpoint per hour is again very small, not even visible in the application run times. This is mainly due to the model and checkpoint file size of 287MB.

Checkpointing interval

As seen from the above results, there is some overhead involved in checkpointing, which varies per application and with the size of the model that has to be written to disk. Thus, there need to be a trade-off made on how often to checkpoint and what overhead is acceptable vs the risk of losing computation time due to a lost job.

As a general guideline, I would advise to keep checkpointing disabled for any jobs running for 1 hour or less. For longer running jobs, it is advisable to start enabling checkpointing at an interval of 1 – 24 hours, where it would be ideal if this would result in a checkpoint every 5-10% of the jobs. E.g. if job is expected to run for 4 days (96 hours), enable a checkpoint every 5-10 hours; and if a job is expected to run for 10 days, enable a checkpoint every 12-24 hours. For longer running jobs, it would be good to do more fine-grained checkpointing.

That being said, with <5% percentage overhead per hour of job, it may be worth it to for you to enable checkpointing.  

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.