Exploring an Automated Testing Strategy for Infrastructure as Code

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

In the cloud computing world, infrastructure-as-code (IaC) refers to the practice of managing and provisioning infrastructure through code. As someone with a development background, this is my preferred alternative to manually provisioning the resources using the Azure portal or Azure CLI. My language of choice these days is Bicep, a domain-specific language (DSL) that uses declarative syntax to define and deploy Azure resources. There are others options available, of course, Terraform being a popular one. Using a combination of input parameters, configuration files etc, it is possible to develop highly reusable and customizable templates that can be used to deploy resources in Azure in a consistent and reproducible manner.

With any code, the first thing you are often advised to think of is how you are going to test it. Of course, real life is never ideal and we are all guilty of writing the code first. We are just prototyping or developing a proof of concept, so basic manual testing is sufficient -- we tell ourselves. In my experience, however, it is a slippery slope and before you know it, the code that you were just doodling on starts taking on a life of its own. This remains true even as I have migrated from application development to infrastructure development. As I started working on smalls POCs for different projects that deployed Azure resources with Bicep, it quickly became apparent that I needed a better way to test my code. I needed a way to test my code in an automated fashion. "Oh, it was just working yesterday before I added a fix for ...", is simply unacceptable.

So when we started working on a new accelerator project that was intended to help customers kickstart their Azure deployments for HPC workflows using Azure Batch, it was obvious that we needed to have a robust testing strategy in place. This blog post is a summary of the approach we took. Bringing together Github Actions and CTest -- a testing framework that is part of the CMake build system -- we have a solution for testing both the deployment of the resources and confirming the resources work as expected.

Testing plan

A quick overview of the code that we want to test. We are only doing broad strokes here, but for those interested in the details of the accelerator project which we call bacc (derived from Batch Accelerator), please refer to the GitHub repository.

bacc is primarily a Bicep module that deploys Azure resources such as Azure Batch accounts, Azure Storage etc. based on configuration customizations described in a simplified JSON. The code includes several example deployments that use this module to demonstrate different real world use-cases. Each example deployment also comes with one or more demo applications to demonstrate how to use the deployed resources. For example, the azfinsim-linux example deploys a set of Azure Batch pools and then runs the AzFinSim application on those pools. The AzFinSim application is a simple Monte Carlo simulation application that is used to simulate financial instruments -- an simple example of a typical HPC workload in financial services industry. The azfinsim-windows example is similar to azfinsim-linux except that it deploys Windows pools instead of Linux pools. bacc also includes a Python-based command line tool that can be used to interact with the deployed resources and run the demo applications. Tasks such as validating deployment, resizing pools, submitting jobs etc. are also supported by this tool.

So when it comes to testing, the test plan is fairly straightforward:

For each example deployment, we want to deploy the resources and confirm that the deployment was successful.
Next, we want to validate that each of deployment is consistent with the configuration provided i.e. if the configuration requested two batch pools, we want to verify that the deployment has two batch pools, if access from public network is disabled, confirm that the resources cannot be accessed from a public network, etc.
On a successfully validated deployment, we next to run the demo application intended for that deployment and confirm it executes successfully.

Since bacc is hosted on GitHub, we can leverage GitHub Actions to automate the testing. GitHub Actions is a CI/CD service that is tightly integrated with GitHub. It allows you to define workflows that can be triggered based on events such as a push to a repository, a pull request etc. The workflows are defined using YAML files that are stored in the repository. For bacc, these are found in the .github/workflows directory.

We decided to use the term test-suite to represent the combination of an example deployment and the demo application that is intended to be run on that deployment. For example, the azfinsim-linux test-suite is the combination of the azfinsim-linux example deployment and the `azfinsim` demo application. Each test suite is independent of the others. The az-deploy.yaml workflow is designed to deploy and test an individual test suite, while the ci-deploy-n-test.yaml workflow executes the az-deploy.yaml workflow for each test suites.

az-deploy.yaml

Let's take a closer look at the az-deploy.yaml workflow. The workflow is triggered either manually or when called by another workflow (namely, ci-deploy-n-test.yaml). Having the manual trigger option is convenient when manually re-running tests either for debugging or simply creating a new deployment for demos etc. The input parameters are intentionally simple: deployment resource group name and location, name of the test suite and two checkboxes whether to deploy resources and whether to skip cleanup resources when finished.

Step: deployment

One of the first significant steps in the workflow is to setup some variables based on the input parameters. These are things like values for deployment parameters. To create a deployment in Azure, we use Azure/login and Azure/arm-deploy GitHub Actions. For authentication, we use OpenID connect with federated credentials as described here.

Once the deployment step is complete, we extract information about the deployment such as certain resource names, resource IDs etc. which are provided as output parameters. These are then used in the subsequent steps.

Step: validating and testing

Once the deployment is complete, we want to run a set of tests to verify that the deployment was successful. A little rumination on the testing requirements and it quickly becomes apparent that we may need different set of tests for different test suites. For example, AzFinSim application tests are only relevant for azfinsim-* test suites. We started off by putting this logic in a simple shell script, but that became too unwieldy too quickly. To make organizing tests and defining the logic for enabling / disabling tests easier, we decided to use CTest. CTest is a testing framework that is part of the CMake build system. CMake is typically used to build C/C++ applications. However it can also be used when there is no build involved. CMake language lets us define input parameters (referred to as CMake configuration variables) and use those to enable/disable tests, customize test execution etc. CTest can then be used to run the test plan generated using CMake. CMake/CTest also seamlessly integrates with CDash -- a dashboard for viewing and inspecting test results which can be very handy to track and diagnose test failures.

The CMakeLists.txt and related CMake files are included in the tests directory of the repository. The variables.cmake file defines the configuration variables that are used to pass information about the Azure deployment which in turn affects which tests are run. In addition to variables that help locate the Azure deployment such as subscription ID, resource group etc., we also expose several SB_SUPPORTS_* variables which one can use to enable/disable tests based on the configuration. For example, if Azure Container Registry (ACR) is not supported, then the SB_SUPPORTS_ACR variable will be set to FALSE and tests that require ACR will be disabled. If public network access to the resources is not supported, then the SB_SUPPORTS_NETWORK_ACCESS variable will be set to FALSE and tests that require public network access will be disabled and tests that confirm that public network access is blocked will be enabled in their stead.

The tests we've added so far are cover a wide gamut of use-cases. Our intent was to cover every usage scenario we have documented in tutorials. A short list of some interesting tests:

validate-* tests validate the deployment and ensure that it matches the configuration that we used to for the deployment. The test parse the configuration json files to determine expected values and then query Azure to confirm that the deployment matches the expected values. For example, if the configuration specified 2 pools, then the `validate-pools` will confirm that the deployment has 2 pools with correct names and SKUs as requested in the configuration files.
azfinsim-* tests test the AzFinSim application by submitting jobs, waiting till they complete and verifying that the job completed successfully. This helps us catch any issues with the deployment that may cause the application to fail. The tests cover different modalities including using container images from DockerHub, if SB_SUPPORTS_DOCKERHUB is set to TRUE, or using container images from ACR, if SB_SUPPORTS_ACR is set to TRUE. In case of ACR tests, we also need to ensure that we build and push the container image for the application before we can run jobs. This is handled by a test itself. By using dependencies between tests, (supported by CMake/CTest as test_fixture), we can ensure that the image is built and pushed before the tests that use it are run.
linux-jumpbox-dashboard test is intended for a deployment where the resources any of the deployed are not accessible from the public internet, instead one has to access them via a jumpbox virtual machine that is on the same virtual network as the resources. This is a common scenario for hub-n-spoke deployments. The test submits a script to be executed on the jumpbox and then waits for the script to complete. The script is responsible for running the test suite requested on the jumpbox and then return the result. The test passes or fails based on the result returned by the script.

The CTest project is setup to upload testing results to CDash for each test suite. The dashboard is kept private

and only accessible to the project maintainers. Here's a snapshot of the latest run as the time of writing this blog post:

Each test suite appears as a separate row. Following image shows the details for one of the test suites where you can see the details of the tests that were run and their results.

Step: cleanup

The final step in the workflow is to cleanup the resources that were deployed. This deletes all resource groups that were created by the deployment. Unless explicitly skipped by passing corresponding input parameter, the cleanup step is always executed regardless of the results of previous steps. This ensures that we don't leave any resources behind.

ci-deploy-n-test.yaml

The ci-deploy-n-test.yaml is simply calls the az-deploy.yaml workflow for each test suite using a test matrix. It is triggered manually or automatically every time the `main` branch is updated.

Putting it all together

Having such a testing strategy in place for IaC code that available for public consumption is a must. It helps us confirm that the code works as expected and also helps us catch any issues that may have been introduced by changes to the code or changes to the underlying Azure services. We hope this post gets you started with thinking about how you can test your IaC code. There is no one-size-fits-all solution and you will need to adapt the approach to your specific needs.