Automate AKS Deployment and Chaos Engineering with Terraform and GitHub Actions

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Azure Chaos Studio is a fully managed chaos engineering platform that helps you identify and mitigate potential issues in your applications before they impact customers. It enables you to intentionally introduce faults and disruptions to test the resilience and robustness of your systems. By using Chaos Studio, you can uncover hard-to-find problems in your applications, from late-stage development through production, and plan mitigations to improve overall system reliability.


The provided GitHub Action workflows demonstrate a comprehensive approach to automating the deployment and management of an AKS (Azure Kubernetes Service) cluster using Terraform, as well as deploying Chaos Mesh experiments and the Azure Vote service within the AKS cluster. These workflows streamline the infrastructure management process by integrating directly with GitHub, enabling seamless updates and deployments based on code changes or manual triggers. By leveraging GitHub Actions, Azure, and Kubernetes, these workflows ensure a robust, automated pipeline for maintaining and testing the resilience of applications deployed in the AKS environment.

Automating AKS with Terraform

To automate the deployment and management of an Azure Kubernetes Service (AKS) cluster, I utilized Terraform with the AKS module provided by Azure. This module simplifies the process by abstracting many of the complex configurations needed to set up and manage an AKS cluster.


In the Terraform configuration, I specified the AKS module with the latest version at the time, ensuring compatibility with the latest features and updates. The configuration began by defining essential parameters, such as the resource group name, Kubernetes version, and admin username. Automatic patch upgrades were enabled to ensure the cluster remains updated with the latest patches.

The cluster was configured to use virtual machine scale sets for agent nodes, with a specific node size and a range of nodes to accommodate varying workloads. Custom Linux OS configurations were applied to the agent nodes, enhancing their performance and security settings.


To enhance security, the API server was restricted to authorized IP ranges, including both public and private IP addresses of a bastion host and additional CIDR ranges. Integration with Azure Container Registry (ACR) was facilitated by attaching the ACR ID to the AKS cluster, enabling seamless container management.


Advanced features such as Azure Policy, auto-scaling, and HTTP application routing were enabled to improve cluster governance, scalability, and traffic management. User-assigned managed identities were employed for secure access control, and key management services (KMS) were enabled to secure sensitive data using Azure Key Vault.


Network settings were carefully configured, including DNS service IP, service CIDR, network plugin, and policy settings, ensuring robust network management and security. Role-based access control (RBAC) was enabled and managed through Azure Active Directory (AAD) to streamline user and group management.


Additional features such as log analytics, maintenance windows, and secret rotation were configured to enhance cluster monitoring, maintenance, and security. Tags and labels were added to agent nodes for better organization and resource management.

By defining these configurations in Terraform, the AKS deployment process was automated, making it reproducible and manageable through code. This approach not only reduced manual intervention but also ensured consistency and reliability in the AKS infrastructure.


Note: The code provided below is for exhibit purposes only and may be outdated at the time of writing. This code was used solely in a demo environment to illustrate the automation of an Azure Kubernetes Service (AKS) cluster/Chaos Mesh using the AKS module in Terraform. While the configuration showcases a comprehensive setup, including security, scalability, and management features, it is essential to review and update the code according to the latest Azure and Terraform best practices and versions when implementing it in a production environment. The exhibit is intended to serve as an educational example and may require modifications to align with current standards and specific use cases.




module "aks" { source = "Azure/aks/azurerm" version = "7.4.0" prefix = random_id.aks.hex resource_group_name = kubernetes_version = "1.27" # don't specify the patch version! admin_username = "azureuser" automatic_channel_upgrade = "patch" agents_availability_zones = ["1"] agents_count = null agents_max_count = var.agents_max_count agents_max_pods = 75 agents_min_count = var.agents_min_count agents_size = "Standard_D2s_v3" agents_pool_name = "testnodepool" agents_type = "VirtualMachineScaleSets" agents_pool_linux_os_configs = [ { transparent_huge_page_enabled = "always" sysctl_configs = [ { fs_aio_max_nr = 65536 fs_file_max = 100000 fs_inotify_max_user_watches = 1000000 } ] } ] api_server_authorized_ip_ranges = concat(["${azurerm_linux_virtual_machine.bastion.public_ip_address}/32", "${azurerm_linux_virtual_machine.bastion.private_ip_address}/32", "REDACTED"],var.chaos_studio_cidr_ranges) attached_acr_id_map = { example = } azure_policy_enabled = true auto_scaler_profile_enabled = true auto_scaler_profile_expander = "least-waste" enable_auto_scaling = true http_application_routing_enabled = true identity_ids = [] identity_type = "UserAssigned" ingress_application_gateway_enabled = false #ingress_application_gateway_id = #ingress_application_gateway_subnet_cidr = "" key_vault_secrets_provider_enabled = true kms_enabled = true kms_key_vault_key_id = "https://${}${}/${azurerm_key_vault_key.aks_key.version}" local_account_disabled = false log_analytics_workspace_enabled = true cluster_log_analytics_workspace_name = random_id.aks.hex microsoft_defender_enabled = false maintenance_window = { allowed = [ { day = "Sunday", hours = [22,23] }, ] not_allowed = [ { start = "2024-01-01T20:00:00Z", end = "2024-01-01T21:00:00Z" }, ] } net_profile_dns_service_ip = "" net_profile_service_cidr = "" network_plugin = "azure" network_policy = "azure" os_disk_size_gb = 60 private_cluster_enabled = false public_network_access_enabled = true rbac_aad = true rbac_aad_managed = true role_based_access_control_enabled = true secret_rotation_enabled = true sku_tier = "Standard" storage_profile_blob_driver_enabled = true storage_profile_enabled = true temporary_name_for_rotation = "a${random_string.aks_temporary_name_for_rotation.result}" vnet_subnet_id = rbac_aad_admin_group_object_ids = [azuread_group.aks_admins.object_id] agents_labels = { "Agent" : "agentLabel" } agents_tags = { "Agent" : "agentTag" } depends_on = [ azurerm_subnet.aks, ] }




Automating AKS with GitHub Actions

The provided GitHub Action workflow automates the deployment of an Azure Kubernetes Service (AKS) cluster using Terraform. This workflow is triggered on two conditions: when changes are pushed to the main branch within the terraform directory, or manually through a workflow dispatch event. The manual trigger allows users to specify the desired Terraform operation (plan, apply, or destroy) through an input parameter. This flexibility enables users to review changes, apply the infrastructure configuration, or tear it down as needed.


The workflow defines a single job named 'Terraform' that runs on the latest Ubuntu environment. It sets up necessary environment variables using secrets for secure authentication with Azure. The steps include checking out the repository, setting up the specified version of Terraform, and initializing Terraform with backend configuration sourced from environment variables. The workflow then validates the Terraform configuration to ensure correctness. Depending on the trigger, it proceeds to execute the appropriate Terraform command: plan to review the changes, apply to deploy the infrastructure, or destroy to remove it. This automation streamlines the management of the AKS cluster, ensuring consistent and reproducible deployments.




on: push: branches: [main] paths: - 'terraform/**' workflow_dispatch: inputs: terraform_operation: description: "Terraform operation: plan, apply, destroy" required: true default: "plan" type: choice options: - plan - apply - destroy name: Deploy AKS Cluster jobs: terraform: name: 'Terraform' runs-on: ubuntu-latest env: ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }} ARM_CLIENT_SECRET: ${{ secrets.ARM_CLIENT_SECRET }} ARM_SUBSCRIPTION_ID: ${{ secrets.ARM_SUBSCRIPTION_ID }} ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }} GITHUB_TOKEN: ${{ secrets.GH_TOKEN }} TF_VERSION: 1.6.1 defaults: run: shell: bash working-directory: ./terraform steps: - name: Checkout uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 with: terraform_version: ${{ env.TF_VERSION }} - name: Terraform Init id: init run: | set -a source ../.env.backend terraform init \ -backend-config="resource_group_name=$TF_VAR_state_resource_group_name" \ -backend-config="storage_account_name=$TF_VAR_state_storage_account_name" - name: Terraform Validate id: validate run: terraform validate -no-color - name: Terraform Plan id: plan run: terraform plan -no-color if: "${{ github.event_name == 'workflow_dispatch' && github.event.inputs.terraform_operation == 'plan' || github.event_name == 'push' }}" - name: Terraform Apply id: apply run: terraform apply -auto-approve if: "${{ github.event_name == 'workflow_dispatch' && github.event.inputs.terraform_operation == 'apply' }}" - name: Terraform Destroy id: destroy run: terraform destroy --auto-approve if: "${{ github.event.inputs.terraform_operation == 'destroy' }}"




Automating Chaos Studio with Terraform

The provided Terraform code defines resources for deploying Chaos Mesh. First, it creates a new Kubernetes namespace named "chaos-testing" using the kubernetes_namespace resource. This namespace isolates the Chaos Mesh components from other workloads in the cluster, enhancing organization and security by confining the chaos engineering experiments to a dedicated area.


Next, the code uses the helm_release resource to install Chaos Mesh via Helm, a package manager for Kubernetes. The Helm chart for Chaos Mesh is specified from its official repository, with version 2.6 explicitly chosen. The installation occurs within the previously defined "chaos-testing" namespace. The set blocks within the helm_release resource customize the installation by configuring the chaosDaemon to use containerd as the runtime and specifying the socket path for the container runtime. This setup ensures that Chaos Mesh integrates correctly with the underlying container runtime, enabling effective chaos engineering experiments to test the resilience and robustness of applications running in the Kubernetes cluster.




resource "kubernetes_namespace" "chaos_testing" { metadata { name = "chaos-testing" } } resource "helm_release" "chaos_mesh" { name = "chaos-mesh" repository = "" chart = "chaos-mesh" namespace = kubernetes_namespace.chaos_testing.metadata[0].name version = "2.6" # specify the version of the Chaos Mesh chart you want to deploy set { name = "chaosDaemon.runtime" value = "containerd" } set { name = "chaosDaemon.socketPath" value = "/run/containerd/containerd.sock" } }




Automating Chaos Studio with GitHub Actions

The GitHub Action workflow provided facilitates the deployment and management of Chaos Mesh experiments and the Azure Vote service within an AKS (Azure Kubernetes Service) cluster. This workflow can be triggered by three types of events: a push to the main branch, a published release, and a manual trigger via workflow_dispatch. The manual trigger allows users to choose between three operations: deploying the vote service, uninstalling the vote service, or deploying chaos experiments.


The workflow defines three separate jobs corresponding to these operations, each running on a self-hosted runner. The deploy_vote_service job checks out the repository, logs into Azure using provided credentials, and sets up the Kubernetes configuration to interact with the AKS cluster. It then creates a namespace and deploys the Azure Vote service. The uninstall_vote_service job follows similar steps but focuses on removing the Azure Vote service from the cluster. The deploy_chaos_experiments job is more complex, involving the setup of the AKS configuration, deployment of chaos experiments, and management of necessary role assignments in Azure AD. It iterates over a set of predefined chaos experiment configurations, applies them, and ensures appropriate permissions are set for the experiments to interact with the AKS cluster. This structured approach ensures a consistent and automated deployment process for both the Azure Vote service and Chaos Mesh experiments.




on: push: branches: - main release: types: [published] workflow_dispatch: inputs: chaos_experiments_operation: description: 'Operation: Deploy Experiments for Chaos Mesh' required: true default: 'deploy_vote_service' type: choice options: - deploy_vote_service - uninstall_vote_service - deploy_chaos_experiments name: Deploy Chaos Mesh Experiments & Vote Service jobs: deploy_vote_service: runs-on: self-hosted if: ${{ github.event.inputs.chaos_experiments_operation == 'deploy_vote_service' }} steps: - name: Checkout uses: actions/checkout@v4 - name: Azure Login uses: azure/login@v1 with: creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID }}"}' - name: kubeconfig run: | az aks get-credentials --resource-group ${{ secrets.AKS_RESOURCE_GROUP }} --name ${{ secrets.AKS_NAME }} --overwrite-existing kubelogin convert-kubeconfig -l azurecli - name: Create Namespace run: | kubectl get namespace azure-vote || kubectl create namespace azure-vote - name: Install Azure Vote Service run: | kubectl apply -f ./app/azure-vote.yaml -n azure-vote kubectl get service azure-vote-front -n azure-vote uninstall_vote_service: runs-on: self-hosted if: ${{ github.event.inputs.chaos_experiments_operation == 'uninstall_vote_service' }} steps: - name: Checkout uses: actions/checkout@v4 - name: Azure Login uses: azure/login@v1 with: creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID }}"}' - name: kubeconfig run: | az aks get-credentials --resource-group ${{ secrets.AKS_RESOURCE_GROUP }} --name ${{ secrets.AKS_NAME }} --overwrite-existing kubelogin convert-kubeconfig -l azurecli - name: Uninstall Azure Vote Service run: | kubectl delete -f ./app/azure-vote.yaml -n azure-vote deploy_chaos_experiments: runs-on: self-hosted if: ${{ github.event_name == 'push' || (github.event_name == 'workflow_dispatch' && github.event.inputs.chaos_experiments_operation == 'deploy_chaos_experiments') }} steps: - name: Checkout uses: actions/checkout@v4 - name: Azure Login uses: azure/login@v1 with: creds: '{"clientId":"${{ secrets.ARM_CLIENT_ID }}","clientSecret":"${{ secrets.ARM_CLIENT_SECRET }}","subscriptionId":"${{ secrets.ARM_SUBSCRIPTION_ID }}","tenantId":"${{ secrets.ARM_TENANT_ID }}"}' - name: Deploy Chaos Experiment AKS Targets run: | for file in ${{ github.workspace }}/json/*.json; do sed -i 's/SUBSCRIPTION_ID_PLACEHOLDER/${{ secrets.ARM_SUBSCRIPTION_ID }}/g' "$file" sed -i 's/RESOURCE_GROUP_PLACEHOLDER/${{ secrets.AKS_RESOURCE_GROUP }}/g' "$file" sed -i 's/AKS_NAME_PLACEHOLDER/${{ secrets.AKS_NAME }}/g' "$file" done # Create the chaos target az rest --method put --uri "${{ secrets.AKS_RESOURCE_ID }}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=${{ secrets.API_VERSION }}" --headers 'Content-Type=application/json' --body "{\"properties\":{}}" headers='{"Content-Type":"application/json"}' # Create the chaos experiments experimentNames=("PodChaos-2.1" "DNSChaos-2.1" "HTTPChaos-2.1" "KernelChaos-2.1" "TimeChaos-2.1" "IOChaos-2.1" "StressChaos-2.1" "NetworkChaos-2.1") for experimentName in "${experimentNames[@]}"; do echo "Creating capability ${experimentName}" az rest --method put --uri "${{ secrets.AKS_RESOURCE_ID }}/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/${experimentName}?api-version=${{ secrets.API_VERSION }}" --headers "$headers" --body "{\"properties\":{}}" echo "Creating experiment ${experimentName}" response=$(az rest --method put --uri "${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.Chaos/experiments/${experimentName}?api-version=${{ secrets.API_VERSION }}" --headers "$headers" --body @"${{ github.workspace }}/json/${experimentName}.json") echo "Response: $response" done - name: Get Principal IDs id: get_principal_ids run: | # Define the experiment names experimentNames=("PODCHAOS-2.1" "DNSCHAOS-2.1" "HTTPCHAOS-2.1" "KERNELCHAOS-2.1" "TIMECHAOS-2.1" "IOCHAOS-2.1" "STRESSCHAOS-2.1" "NETWORKCHAOS-2.1") principal_ids="" for experiment_name in "${experimentNames[@]}"; do echo "Processing experiment: $experiment_name" api_url="${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.Chaos/experiments/$experiment_name?api-version=2024-01-01" echo "API URL: $api_url" experiment_response=$(az rest --method get --uri "$api_url") echo "Response for $experiment_name: $experiment_response" principal_id=$(echo $experiment_response | jq -r '.identity.principalId') echo "Principal ID for $experiment_name: $principal_id" principal_ids="$principal_ids$principal_id," done principal_ids="${principal_ids%,}" # Remove trailing comma echo "principal_ids=$principal_ids" >> $GITHUB_ENV echo "::set-output name=principal_ids::$principal_ids" - name: Add Principals to AD Group and Assign AKS Cluster Admin Role run: | IFS=',' read -ra IDS <<< "${{ steps.get_principal_ids.outputs.principal_ids }}" for id in "${IDS[@]}"; do # Check if the principal is already a member of the AD group group_member_check=$(az ad group member check --group "${{ secrets.AKS_AD_GROUP }}" --member-id "$id" --query 'value' -o tsv) if [ "$group_member_check" == "false" ]; then az ad group member add --group "${{ secrets.AKS_AD_GROUP }}" --member-id "$id" else echo "Principal $id is already a member of the AD group." fi # Check if the principal already has the AKS Cluster Admin role role_assignment_check=$(az role assignment list --assignee "$id" --role "Azure Kubernetes Service Cluster Admin Role" --scope "/subscriptions/${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.ContainerService/managedClusters/${{ secrets.AKS_NAME }}" --query 'length(@)' -o tsv) if [ "$role_assignment_check" -eq 0 ]; then # Assign AKS Cluster Admin role az role assignment create \ --assignee-object-id "$id" \ --role "Azure Kubernetes Service Cluster Admin Role" \ --scope "/subscriptions/${{ secrets.ARM_SUBSCRIPTION_ID }}/resourceGroups/${{ secrets.AKS_RESOURCE_GROUP }}/providers/Microsoft.ContainerService/managedClusters/${{ secrets.AKS_NAME }}" else echo "Principal $id already has the AKS Cluster Admin role assigned." fi done





Automating Chaos Studio JSON Templates with GitHub Actions and Terraform

The JSON configuration provided (also see Azure Chaos Studio fault and action library) defines a detailed chaos experiment setup intended for deployment within an AKS (Azure Kubernetes Service) cluster. This configuration, which is stored in a separate root GitHub folder named json, is utilized by the GitHub Action workflows to orchestrate chaos engineering experiments using Chaos Mesh. By keeping these JSON configurations organized in a dedicated folder, the workflows can easily reference and apply them during deployment, ensuring a structured and maintainable approach to chaos testing.


The JSON file specifies the location of the experiment (eastus) and sets up a system-assigned identity for the resources. Within the properties section, the experiment steps are outlined, beginning with "Step 1." This step includes a single branch ("Branch 1") that defines a continuous action targeting all pods within the "azure-vote" namespace. The action is configured to simulate pod failures for a duration of five minutes, utilizing a specific Chaos Mesh capability (podChaos/2.1). The JSON configuration also defines a selector ("Selector1") that identifies the specific AKS cluster targeted by the experiment. This setup ensures that the chaos experiment is precisely targeted and executed within the intended cluster, helping to test the resilience and fault tolerance of the applications running in the "azure-vote" namespace.


By integrating these JSON configurations into the GitHub Action workflows, the automation process becomes seamless. The workflows dynamically replace placeholder values (SUBSCRIPTION_ID_PLACEHOLDER, RESOURCE_GROUP_PLACEHOLDER, and AKS_NAME_PLACEHOLDER) with actual values during execution. This dynamic replacement allows for flexibility and reusability of the JSON configurations across different environments and clusters. The structured approach of keeping these configurations in a dedicated folder and calling them within the GitHub Action workflows ensures a streamlined and efficient process for deploying and managing chaos experiments, ultimately contributing to the robustness and reliability of the AKS-deployed applications.




{ "location": "eastus", "identity": { "type": "SystemAssigned" }, "properties": { "steps": [ { "name": "Step 1", "branches": [ { "name": "Branch 1", "actions": [ { "type": "continuous", "selectorId": "Selector1", "duration": "PT5M", "parameters": [ { "key": "jsonSpec", "value": "{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"azure-vote\"]}}" } ], "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1" } ] } ] } ], "selectors": [ { "id": "Selector1", "type": "List", "targets": [ { "type": "ChaosTarget", "id": "/subscriptions/SUBSCRIPTION_ID_PLACEHOLDER/resourceGroups/RESOURCE_GROUP_PLACEHOLDER/providers/Microsoft.ContainerService/managedClusters/AKS_NAME_PLACEHOLDER/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh" } ] } ] } }





We covered several aspects of automating and managing AKS (Azure Kubernetes Service) clusters and chaos engineering experiments using Terraform and GitHub Actions. We started by detailing the Terraform code used to deploy an AKS cluster, highlighting the configuration of various components such as agent nodes, network settings, security policies, and integrations with Azure services. This automation not only ensures a consistent deployment process but also leverages the power of infrastructure as code to manage complex cloud resources efficiently.


We then explored a GitHub Action workflow designed to automate the deployment and management of Chaos Mesh experiments and the Azure Vote service. This workflow uses triggers based on code changes and manual inputs to execute specific tasks, such as deploying, uninstalling, or running chaos experiments within the AKS cluster. By integrating Azure credentials and Kubernetes configurations, the workflow streamlines the process of setting up and managing these experiments, ensuring that they are applied accurately and securely.


Additionally, we delved into the JSON configurations used for chaos experiments, stored in a dedicated GitHub folder and referenced within the GitHub Action workflows. These configurations define detailed chaos experiment steps and selectors, targeting specific resources within the AKS cluster to simulate various fault scenarios. By organizing these configurations and automating their deployment, we enhance the resilience and fault tolerance of applications running in the cloud.


Together, these discussions illustrate a robust approach to managing cloud infrastructure and testing application resilience through automation and chaos engineering. Utilizing Terraform for infrastructure deployment and GitHub Actions for orchestration and management allows for a streamlined, efficient, and consistent process, ultimately contributing to more reliable and resilient cloud-native applications.


Here are some helpful links from Microsoft Learn that relate to the topics we discussed today:

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.