Moving HPC applications to object storage: a first exercise

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Hugo Meiland (hugo.meiland@microsoft.com)

With this blog / paper we demonstrate using object storage as the direct back end for running HPC applications on Azure. We expect this to be usable for read-compute-write application flows, where the read / write can be either the full or partial dataset.

POSIX important for HPC

Traditionally, HPC applications and workflows rely heavily on POSIX file systems to support the data I/O requirements. Due to this strong dependency, storage solutions are an integrated part in most on-premises HPC systems. Besides the more general NFS based volumes, even more specialized storage technologies like Lustre provide POSIX data access that are designed to scale with the computational capabilities.

When implementing a traditional HPC architecture in a cloud environment, we see significant optimizations such as autoscaling the on-line virtual machines (VM) based on the actual requested workload. For storage, this is more challenging as the capacity is a criterial dimension and so is the I/O performance. Hence, the storage costs are not only calculated based on the actual usage, but on the reserved usage of the POSIX storage system.

Alternative storage backend

For object storage, this is different. The cost model is based on the actual volume of stored data and a small overhead for iops. The available bandwidth is high and scales with each storage account, not through volume size. For reference, a single blob storage account provides by default already , this is roughly twice the bandwidth a 100 TB volume on Azure Netapp Files or Azure Files is capable of today.

However, the API for object storage is completely different from the POSIX API. Therefore, the amount of work needed to migrate the HPC applications towards object storage can be significant. To better understand this migration path, we have investigated application modernization of a Reverse Time Migration (RTM) workflow, which is an algorithm heavily used in the Energy vertical for seismic imaging. It utilizes both significant compute, large storage demands and large-scale resource orchestration. This particular implementation does checkpointing (part of the algorithm) in memory and therefore does not need external storage for this.

The demonstration application is built using the Devito framework to allow just-in-time compilation of the compute intensive elements for different CPU and GPU architectures. The high-level programming was done in Python and for the I/O, a library was used that implements support for different data types and storage options.

The orchestration of the RTM runs were done on clusters with PBS and Slurm. However more cloud-native schedulers like Batch can also be used.

Libraries for I/O

Fortunately, most application developers use libraries to abstract the I/O. Well known examples are HDF5 and netcdf. By using these libraries, modernization at the I/O level can be introduced and implemented into many applications at the same time. One active library that is modernizing I/O is Zarr. Zarr supports not only traditional POSIX backends, but also works on object store backends. Fortunately, the best implementation is in Python. The object store backends that are supported through are either based on S3, but also Azure object storage, blob storage can be used.

Migrating from SEG-Y to HDF5 and Zarr

One of the advantages of building an application based on a “proper” I/O library is the integration with the rest of the code: numpy arrays are being used to manipulate the data and in the original SEG-Y implementation, there was quite some code required to map the data from the numpy array into the storage model. This path was initially chosen to mimic legacy applications and demonstrate large scale orchestration with a view of upgrading the storage with Zarr. Following large scale tests of ‘legacy-like’ code, the data structure was changed using Zarr which allowed to just copy/map the data structures in a single line. And since the HDF5 library has the same interface, switching over to this did not require any code changes.

# Code snippet to allow for object store for data volumes (vo) with metadata headers (h) if backend == 'hdf5': import h5py f = h5py.File("data/mytestfile.hdf5", "w") volume = np.empty(shape_full, dtype=np.float32) vo = f.create_dataset("volume", data=volume) h = f.create_dataset("header", data=header) elif backend == 'zarr': import zarr store = zarr.DirectoryStore('models/smooth_model.zarr') root = zarr.group(store=store, overwrite=True) vo = root.zeros('volume', shape=shape_full, dtype=np.float32) h = root.zeros('header', shape=shape_header, dtype=np.float32) elif backend == 'zarr_uncompressed': import zarr store = zarr.DirectoryStore('models/smooth_model.zarr_uncompressed') root = zarr.group(store=store, overwrite=True) vo = root.zeros('volume', shape=shape_full, compressor='none', dtype=np.float32) h = root.zeros('header', shape=shape_header, compressor='none', dtype=np.float32) elif backend == 'zarr_blob': import zarr import azure.storage.blob container_client = azure.storage.blob.ContainerClient( account_name=os.getenv("AZURE_STORAGE_ACCOUNT_NAME"), credential=os.getenv("AZURE_STORAGE_ACCESS_KEY"), account_url="https://{}.blob.core.windows.net".format(os.getenv("AZURE_STORAGE_ACCOUNT_NAME")), container_name='msft-seismic-zarr') store = zarr.ABSStore(client=container_client, prefix='vel_model_layered_sm') root = zarr.group(store=store, overwrite=True) vo = root.zeros('volume', shape=shape_full, dtype=np.float32) h = root.zeros('header', shape=shape_header, dtype=np.float32) vo[:,:,:] = v h=header

Performance

Since the goal is to write a highly performant code, we have looked at the performance impact of these I/O libraries and storage backends in different parts of the workflow. This initial step of the workflow is to produce the initial model volume in the appropriate format which is required by the application. This is a very write intensive step with almost no compute. In total 5 I/O libraries settings were used: 4 on POSIX: the industry standard SEG-Y file, HDF5, Zarr uncompressed and Zarr with Blosc compression. The final test was done using Zarr with Blosc compression directly on Azure Blob.

Azure Netapp Files was used for the NFS storage with a standard 4 TB volume that delivers 64 MB/s. This was kept relative low performance on purpose to show the impact of changes at the library level.

Library	SEG-Y	HDF5	Zarr(none)	Zarr(blosc)	Zarr (blosc)on blob
Volume Size	23 GB	21 GB	20 GB	136 MB	136 MB
Volume create runtime (sec)	910	885	591	340	337

As can be seen above, the change from SEG-Y to HDF5 is giving some improvements, but only modest. When moving to Zarr the improvement is significant at almost 35%. One of the possible explanations is that Zarr is using multiple files for storage instead of only one, and therefore the I/O can be parallelized. Zarr supports compression of the data to further improve the I/O. Since the volume that was created was very homogenous, the impact of enabling compression was large, with a speedup of 63%. The final test was done directly on Azure blob. The performance was on par with the compressed Zarr on POSIX.

The second test was based on the . For this step only the Zarr library was tested with different backends and at different scales (10-50-100 vm’s). The I/O pattern in this code is to read a part of the volume and a 100 MB file with shot data (more app input data). The output is a 3D cube of a few 100 MB’s but can scale as the problem size grows.

Library	10vm Zarr(none)	10vm Zarr(blosc)	10vm Zarr blob	50vm Zarr blob	100vm Zarr blob
Average single shot runtime (sec)	1339	1338	1346	1345	1337

As can be seen in the first three tests, the average run time of the imaging steps is now on par between both uncompressed and compressed POSIX, and the compressed Blob. The I/O is no longer the bottleneck with this implementation of the RTM and even scaling up to 50 vm’s or 100 vm’s in parallel does not influence the runtimes. For this application and I/O pattern at a significant scale, Zarr on blob is clearly a valid option.

Costs

Comparing the cost of moving from POSIX to blob is not easy, since the pricing model is different. While the Blob storage costs are based solely on the actual used capacity, with a small overhead for the iops. The Posix filesystem is based on reserved capacity, where the capacity is directly related to the performance. All the costs referenced below are based on pay-as-you-go list prices for the SouthCentralUS region as of Nov 1^st 2022.

The below scenario tries to show the impact of running on Azure Blob compared to ANF: when comparing the storage costs for a single run, the difference is huge: Blob is just 1½% of the ANF costs.

	1x RTM run, 100 GB used	40x RTM run, 4TB used
ANF	$602 (standard)	$1608 (ultra)
Blob Hot	$9	$360

However, doing a single run leaves the ANF capacity underutilized. When trying to fully utilize the reserved capacity for ANF, there can be 40 runs done. To be able to do these 40 runs in a month time, there are ~100 vm’s needed. For this the standard ANF bandwidth is on the low side, so hence we increase bandwidth by moving to the Ultra tier. For this scenario the blob costs are still significantly lower at 22% of the ANF costs.

While the differences above are significant, it would be good to see them in perspective of the computational costs of the RTM runs. In the table below it can be seen that in the best case (40 runs) the storage costs are less than 1%, while for a single run the costs are still less than 10%.

	1x RTM run, 100 GB used	40x RTM run, 4TB used
ANF	$602 (standard)	$1608 (ultra)
Blob Hot	$9	$360
Compute (4k shots)	$6177	$247,080

Conclusions

With this example we have shown that for certain I/O scenarios, Blob storage is becoming a valid alternative for traditional POSIX. For applications that have already abstracted the I/O by using libraries like hdf5, the required changes are minimal. For other applications the investment of abstracting is worthwhile, since it will create several steps of improving the I/O. Examples are multiple files, compression and choices between POSIX and Blob.

The capability of using Blob storage directly will allow for integration into e.g. Microsoft Energy Data Services: A fully managed OSDU Data Platform in the cloud - Microsoft Industry Blogs), the Microsoft implementation of OSDU. And through that, it will be possible to start running these workloads in a completely cloud-native manner, without the requirement of having to adhere to POSIX io standards.

While storage costs are an important part to optimize, it should also be reflected in the big picture: in the worst-case example (single RTM run on ANF) above the storage costs where still . The above arguments also apply to other workloads in the Energy sector like Reservoir simulation where the ratio of compute to I/O can possibly make the Blob storage even more cost effective.

#AzureHPCAI