This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
Overview
Java applications often face startup delays due to their runtime initialization and class loading processes. In the cloud-native era, applications start and stop more frequently, with an increasing need for scale-out to accommodate dynamic traffic demands, making this issue even more prominent. To mitigate this, CRaC (Coordinated Restore at Checkpoint) offers a solution to this challenge by allowing applications to be checkpointed and restored, thus avoiding lengthy startup time after the first initialization. Based on the experiment on the Spring PetClinic project project, we observed a 7x improvement in startup speed after enabling CRaC on Azure Kubernetes Service.
In the final section, we will discuss CRaC's limitations and potential future developments. We welcome your feedback, which will help us continue improving and optimizing Java on Azure. Feel free to share your thoughts in the comments section at the end of this article.
Next, we will walk through how to:
1. Package and containerize a Java application locally.
2. Deploy it to Azure Kubernetes Service (AKS).
3. Utilize CRaC to create a checkpoint.
4. Create a new application to restore from the checkpoint.
5. Compare the startup performance between the original and restored applications.
Packaging a Java Application
Before we deploy our Java application to AKS, we need to package it and create a container image. Follow these steps to clone and package the application:
1. Clone the Repository and build the Application:
For this example, we will use the popular Spring PetClinic project, which can be found on GitHub.
Note, this repo is a fork of the official Spring PetClinic project. The only modification made is the addition of Spring CRaC dependencies.
For more details, please refer to https://docs.spring.io/spring-framework/reference/integration/checkpoint-restore.html
2. Create a Dockerfile:
Create a Dockerfile to define how your application will be containerized. Note, the Zulu JVM, which offers good support for CRaC, is used here. In the Java startup parameters, the location where the checkpoint image will be stored has been added.
3. Build the Docker Image:
Use Docker to build the image:
Creating a Deployment on Azure Kubernetes Service
With the application containerized, we can now deploy it on AKS. Follow these steps:
1. Create an AKS Cluster:
If you don't have an AKS cluster, create one using the Azure CLI:
2. Push the Docker Image to Azure Container Registry (ACR):
If you are using **Azure Container Registry**, tag the image and push it to ACR:
3. Create an image pull secret to your ACR
4. Create Azure File to mount to the deployment
Note, since the speed of restoring from a checkpoint is closely related to disk performance, it is highly recommended to use Azure Storage in the same region.
5. Create a Kubernetes Deployment:
Create a deployment YAML file (`deployment.yaml`) for your application:
6. Deploy to AKS:
Apply the deployment to your AKS cluster:
7. Check start up logs and duration:
As you can see, the startup typically takes a little over 8 seconds.
Creating a Checkpoint with CRaC
With the application running, the next step is to create a checkpoint using CRaC.
1. Create the Checkpoint:
Once the application reaches the desired state (e.g., after fully initializing), issue a checkpoint command. CRaC will capture the application's state, which can later be restored for fast startups. The image will be stored in the external volumes in the Azure Storage file share created just before.
Restoring from the Checkpoint
Now that we have created a checkpoint, we can package this state into a new Docker image and deploy it for fast restores.
1. Update deployment to restore Image in AKS:
Modify your deployment YAML to use the restored command when start the container:
Apply the changes:
2. Check startup time
This time, the startup took just over one second!
Performance Comparison
The final step is to compare the startup times of the original and restored versions of the application.
1. Measure Startup Time:
For both the original and restored applications, measure the time it takes from container start to application readiness. Compared to the original startup, which took over 8 seconds, restoring from the checkpoint reduced the startup time to just over 1 second—a 7x improvement. What's more, this significant boost only requires adding the CRaC dependency, without any additional code modifications.
2. Compare Results:
Besides, the CRaC-enabled application should demonstrate significantly faster startup times due to restoring from the pre-initialized checkpoint. You can achieve this by creating the checkpoint after giving your Java application sufficient time to warm up.
Conclusion
In this post, we walked through how to leverage CRaC to accelerate the startup of a Java application running on Azure Kubernetes Service. By checkpointing a fully-initialized application and restoring it later, we can drastically reduce startup times, improving performance for both cold and warm starts in containerized environments. CRaC is a promising technology, especially in environments where fast application startup is critical, such as serverless platforms or microservices architectures.
As a comparison, Spring Native is another way to improve performance. Spring Native enables developers to compile Spring applications into native binaries using GraalVM, offering extremely fast startup and low memory usage, which is ideal for short-lived, stateless services. CRaC maintains full JVM capabilities, while Spring Native may require code adjustments and has longer build times.
However, as a relatively new technology, CRaC has its own limitations. For instance, many third-party libraries do not yet support CRaC. Currently, Spring Boot, Quarkus, and Micronaut all support CRaC, but there are still many frameworks and libraries that need to be adapted for CRaC compatibility. Additionally, it requires that the application closes all open file handles before capturing the checkpoint. You may refer to https://github.com/CRaC/docs/blob/master/fd-policies.md for more details. CRaC also demands that the environment at the time of checkpoint creation closely matches the environment during restore.
We will continue to closely monitor these limitations and work alongside the community to improve its broader applicability.
We would also love to hear your thoughts on this technology. Your feedback will help us improve how Java runs on Azure. Feel free to share your thoughts in the comments section at the end of this article.