This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
Some HPC simulations take a long time to run and run over multiple nodes (up to days and weeks). When running these long jobs, failure rate and availability of the underlying infrastructure may become an issue. One solution for recovering from these failures is using checkpointing during the simulation to provide a starting point in case the simulation is terminated for any reason.
When running compute jobs in Azure, most issues from running on bare metal compute nodes still remain. Hardware can and will fail, memory can have issues, and the InfiniBand network may encounter problems. On top of that, due to security and functional updates the (virtualization) software stack may not stay static throughout the duration of the job.
A way to prevent the impact of the potential failures is to save intermediate results, in such a way that a calculation can be restarted from that result point. The creation of this intermediate result is called creating a checkpoint.
Checkpointing can be implemented at different levels: within the applications and at the OS level. If available, the checkpointing within applications are the most stable, and many applications do support checkpointing natively. Examples being described here are .
One significant downside of checkpointing is the amount of I/O that is added to the job during a checkpoint, and the incurred job delay during this writing. Therefor there needs to be a good balance made for the time between checkpoints and the risk of job failure.
Summary
The actual amount of overhead of checkpointing is dependent on the size of the model in memory and the resulting file size that has to be written to disk. For smaller models this file size can be in the range of hundreds of megabytes and doing a single checkpoint per hour may not even be visible in the simulation speed. For larger models, this will be in the range of Gigabytes and here a reasonable fast IO system will allow to limit the overhead to stay within <5% range. Depending on the total runtime of the simulation it makes sense to implement checkpointing on an hourly up to daily interval.
Cluster setup
In the examples below, an Azure CycleCloud managed IBM LSF cluster is used. All the jobs are submitted from a regular user account, which has a shared homedir, hosted on Azure Netapp Files. The base configuration of this cluster is set up using AzureHPC (https://github.com/Azure/azurehpc).
In this setup the homedir of the user is used for all job storage, this includes job files, input files, output files and checkpoint files. All applications were compiled and installed using EasyBuild (http://easybuilders.github.io/easybuild/) framework.
Even though this paper focusses on node failures and preventive measures at job level for this; another potential failure point is the LSF master node of the cluster. Under normal circumstances a failure on the master node will not influence the job run. That being said, one of the options is to have a high availability (HA) setup of a set of master nodes. .
LAMMPS
This Molecular Dynamics application is written at Sandia labs (https://lammps.sandia.gov/) and is open source. It supports MPI and can run large and long-lasting jobs. LAMMPS does support checkpointing natively and this can be enabled by using the “restart” command in the input file.
For this example, a (simple) Lennard-Jones based input file is used:
# 3d Lennard-Jones melt
variable x index 1
variable y index 1
variable z index 1
variable xx equal 400*$x
variable yy equal 400*$y
variable zz equal 400*$z
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
restart 1000 lj.restart
run 100000 every 1000 NULL
The jobscript creates a working directory based on the job-id, copies in the used executable and the input file and starts a mpirun based on the amount of cores assigned to the job:
#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ~/apps/lammps-7Aug19-mpi/src/lmp_mpi .
cp ../in.lj .
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj
The job is submitted using LSF’s bsub command and 4 HC nodes are requested, using a total of 176 cores:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./lammpsjob.sh
Once the nodes are acquired and started up, the simulation kicks off. The job-id that was given is 25, so we can now look at the job directory to follow the progress. Due to the restart option in the input file, a request is being made every 1000 simulation steps to create a for all the 256 million atoms. In this case this results in a 21GB file being created roughly every half hour.
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 12:20 lj.restart.1000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 12:51 lj.restart.2000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 13:22 lj.restart.3000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 13:53 lj.restart.4000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 14:24 lj.restart.5000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 14:56 lj.restart.6000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 15:26 lj.restart.7000
Now let’s assume that the job exited due to some failure ().
To restart the simulation, a modified input file has to be made; mostly because the read_restart and the create_box command conflict: they will both try to create the simulation space with the atoms. A working restart file is the following:
# 3d Lennard-Jones melt
read_restart lj.restart.*
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
restart 1000 lj.restart
run 100000 upto every 1000 NULL
First, the read_restart is defined; it will look for the latest checkpoint file and use that to recreate the atoms. Next the run command is modified to include upto; this will tell the run to not restart the simulation and run 100000 steps additional to the restart state, but to only run up to the initially defined endpoint.
Since we are already in a working directory with all the required information, we also need to change the job submission script to look something like this:
#!/bin/bash
cd ~/jobs/25
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj.restart
And we can submit this again using LSF’s bsub:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./restart-lammpsjob.sh
This will start a new job, that takes the latest checkpoint file and continues the simulating from the defined checkpoint.
Another way of implementing checkpointing that uses significantly less storage is restarting based on 2 restart files. In this version we write quite some data every half hour to store the checkpoint files. This can be optimized, by using the following restart entries in the input file:
restart 1000 lj.restart.1 lj.restart.2
When restarting from this state, check which file was written latest and use this specifically when adding the read_restart entry into the input file. Also do not forget to add the upto entry. After resubmitting the simulation, it will continue from the last checkpoint.
To quantify the checkpoint overhead; multiple jobs were submitted, each scheduled to simulate 200 steps and lasting approximately 1 hour and using 176 cores. One of the jobs is doing a single checkpoint at 1400 steps minutes, one is doing 2 checkpoints, each at 800 steps and one has no checkpointing enabled.
Checkpoint/run |
Runtime |
Steps |
Lost time |
Overhead |
Extrapolated for 24h job |
0 |
3469 sec |
|
0 |
0% |
0% |
1 |
3630 sec |
2000 |
161 sec |
4,6% |
0,18% |
2 |
3651 |
2000 |
182 sec |
5,2% |
0,21% |
The checkpoint overhead for this model is quite small with ~5% impact for hourly checkpoint. The combined checkpoint file size is 21GB, which is being written during each checkpoint.
NAMD
A second example of a molecular dynamic application is NAMD. This application is written and maintained by the University of Illinois (https://www.ks.uiuc.edu/Research/namd/). This application is open source, even though registration is required.
NAMD also supports checkpointing natively. For this demonstration the STMV benchmark is used which models about 1M atoms.
To enable checkpointing the following lines need to be added to the stmv.namd input file:
outputName ./output
outputEnergies 20
outputTiming 20
numsteps 21600
restartName ./checkpoint
restartFreq 100
restartSave yes
And that input file can be used by the following jobscript:
#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ../stmv/* .
env
module avail
module load NAMD/2.13-foss-2018b-mpi
module list
mpirun -n $LSB_DJOB_NUMPROC namd2 stmv.namd
The job is submitted with the following LSF submit script:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./namd-mpi-job
Once started, this job will create checkpoint files in the working directory:
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9500.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9500.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9500.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9600.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9600.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9600.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9700.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9700.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9700.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9800.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9800.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9800.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9900.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9900.vel
-rw-r--r--. 1 ccadmin ccadmin 258 Jan 15 17:59 checkpoint.9900.xsc
If at this point the job fails, the checkpoint files can be used to restart the job. For this the input file needs to be adjusted. Looking for the latest checkpoint file available, the timestep is 9900. Now in the input file the following two changes are needed, first tell the simulation to restart at time 9900 and second to read and initialize the simulation from the checkpoint files of checkpoint.9900.
outputName ./output
outputEnergies 20
outputTiming 20
firsttimestep 9900
numsteps 21600
restartName ./checkpoint
restartFreq 100
restartSave yes
reinitatoms ./checkpoint.9900
After resubmitting the job with bsub, the simulation will continue from the checkpoint and stop at the original intended step, see example output:
TCL: Reinitializing atom data from files with basename ./checkpoint.9900
Info: Reading from binary file ./checkpoint.9900.coor
Info: Reading from binary file ./checkpoint.9900.vel
[…]
TIMING: 21600 CPU: 1882.08, 0.159911/step Wall: 1882.08, 0.159911/step, 0 hours remaining, 1192.902344 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 21600 366977.1834 278844.5518 81843.4582 5043.3226 -4523413.8827 385319.1671 0.0000 0.0000 942875.0190 -2462511.1806 296.5584 -3405386.1995 -2454036.3329 296.7428 15.0219 19.7994 10190037.8159 -1.4609 -1.0415
WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 21600
WRITING COORDINATES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART VELOCITIES
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 21600
WRITING COORDINATES TO OUTPUT FILE AT STEP 21600
WRITING VELOCITIES TO OUTPUT FILE AT STEP 21600
====================================================
WallClock: 1894.207520 CPUTime: 1894.207520 Memory: 1192.902344 MB
[Partition 0][Node 0] End of program
To quantify the checkpoint overhead; multiple jobs were submitted, each lasting approximately 1 hour and using 176 cores. One of the jobs is doing a single checkpoint at 40000 steps minutes, one is doing 2 checkpoints, each at 20000 steps and one has no checkpointing enabled.
Checkpoint/run |
Runtime |
Steps |
Lost time |
Overhead |
Extrapolated for 24h job |
0 |
3300 sec |
72000 |
0 |
0% |
0% |
1 |
3395 sec |
72000 |
95 sec |
2.8% |
0,11% |
2 |
3417 sec |
72000 |
117 sec |
3,5% |
0,13% |
The checkpoint overhead in this model is ~3%; with a combined checkpoint file size of 50MB.
Gromacs
Another Molecular Dynamics package is Gromacs, originally started at the University of Groningen in the Netherlands, it now has its home at Upsalla University in Sweden. This is another open source package. Gromacs can be found at http://www.gromacs.org.
The dataset used for this work is the open benchmark benchPEP (https://www.mpibpc.mpg.de/grubmueller/bench) . This dataset has ~12 million atoms and is roughly 293 MB in size.
Gromacs supports checkpointing natively and to enable it, use the following option on the mdrun command line:
mpirun gmx_mpi mdrun -s benchPEP.tpr -nsteps -1 -maxh 1.0 -cpt 10
The –cpt 10 will initiate a checkpoint every 10 minutes. This is shown in the log files as:
step 320: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32565.0 M-cycles
step 480: timed with pme grid 288 288 288, coulomb cutoff 1.302: 34005.8 M-cycles
step 640: timed with pme grid 256 256 256, coulomb cutoff 1.465: 43496.5 M-cycles
step 800: timed with pme grid 280 280 280, coulomb cutoff 1.339: 35397.0 M-cycles
step 960: timed with pme grid 288 288 288, coulomb cutoff 1.302: 33611.2 M-cycles
step 1120: timed with pme grid 300 300 300, coulomb cutoff 1.250: 32671.0 M-cycles
step 1280: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31965.1 M-cycles
optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 3200 at Thu Mar 19 08:24:14 2020
After killing a single thread on one of the machines, the job is stopped:
--------------------------------------------------------------------------
mpirun noticed that process rank 71 with PID 17851 on node ip-0a000214 exited on signal 9 (Killed).
--------------------------------------------------------------------------
This leaves us with a state.cpt file that was written during the last checkpoint. To restart from this state file, the following mdrun command can be used:
mpirun gmx_mpi mdrun -s benchPEP.tpr -cpi state.cpt
When starting this, the computation continues from the checkpoint:
step 3520: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31738.8 M-cycles
step 3680: timed with pme grid 288 288 288, coulomb cutoff 1.310: 33970.4 M-cycles
step 3840: timed with pme grid 256 256 256, coulomb cutoff 1.474: 44232.6 M-cycles
step 4000: timed with pme grid 280 280 280, coulomb cutoff 1.348: 36289.3 M-cycles
step 4160: timed with pme grid 288 288 288, coulomb cutoff 1.310: 34398.6 M-cycles
step 4320: timed with pme grid 300 300 300, coulomb cutoff 1.258: 32844.2 M-cycles
step 4480: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32005.9 M-cycles
optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 8080 at Thu Mar 19 09:03:05 2020
And the new checkpoint can be seen:
-rw-rw-r--. 1 ccadmin ccadmin 287M Mar 19 09:03 state.cpt
-rw-rw-r--. 1 ccadmin ccadmin 287M Mar 19 09:03 state_prev.cpt
To verify the checkpoint overhead; multiple jobs were submitted, each lasting 1 hour (using -maxh 1.0) and using 176 cores. One of the jobs is doing a single checkpoint after 45 minutes, one is doing 2 checkpoints, each at 20 minutes and one has no checkpointing enabled.
Checkpoint/run |
Steps |
Lost steps |
Overhead |
Extrapolated for 24h job |
0 |
38000 |
0 |
0% |
0% |
1 |
38000 |
0 |
0% |
0% |
2 |
36640 |
1360 |
3,6% |
0,16% |
As can be seen, the overhead of doing a single checkpoint per hour is again very small, not even visible in the application run times. This is mainly due to the model and checkpoint file size of 287MB.
Checkpointing interval
As seen from the above results, there is some overhead involved in checkpointing, which varies per application and with the size of the model that has to be written to disk. Thus, there need to be a trade-off made on how often to checkpoint and what overhead is acceptable vs the risk of losing computation time due to a lost job.
As a general guideline, I would advise to keep checkpointing disabled for any jobs running for 1 hour or less. For longer running jobs, it is advisable to start enabling checkpointing at an interval of 1 – 24 hours, where it would be ideal if this would result in a checkpoint every 5-10% of the jobs. E.g. if job is expected to run for 4 days (96 hours), enable a checkpoint every 5-10 hours; and if a job is expected to run for 10 days, enable a checkpoint every 12-24 hours. For longer running jobs, it would be good to do more fine-grained checkpointing.
That being said, with <5% percentage overhead per hour of job, it may be worth it to for you to enable checkpointing.