This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
The Azure Linux container host is Microsoft’s Linux distribution, tailor-made for cloud-native environments and highly optimized for Azure Kubernetes Service (AKS). Microsoft Threat Protection (MTP) Kubernetes Compute Platform, the single largest application on AKS, recently migrated to Azure Linux nodes on AKS and saw numerous advantages. This blog covers those advantages as well as how we migrated, by utilizing the combined functionality of Cluster Autoscaler, Node Problem Detector, and Draino.
“Our transition to Azure Linux was initially prompted by the goal of minimizing the system's vulnerability to threats. However, we quickly observed enhancements in performance upon initial boot-up, which, in turn, enabled us to decrease our 'warm standby' reserve capacity thanks to how quickly new infrastructure could be brought online. The reduced attack surface is important to us, as our platform provides compute infrastructure for many products in the Microsoft Security portfolio.”- Joshua Johnson, Group Engineering Manager in Microsoft Security
Azure Linux advantages:
- Security: Azure Linux's minimalist design contains only the necessary packages required to run container workloads. This design philosophy not only diminishes the attack surface but also makes it more manageable to maintain a secure node pool environment.
- Performance: Being very lightweight, Azure Linux boasts rapid boot times and lower memory consumption. This has improved our overall cluster performance across hundreds of AKS clusters.
- Cost Savings: By needing fewer resources due to its smaller footprint, provisioning Azure Linux node pools has allowed us to be more cost-efficient.
- Compatibility and Tooling: We were happy to learn that we could migrate to Azure Linux without letting go of familiar tools. Azure Linux seamlessly integrates with common AKS tools, as well as an array of partner tooling and software that our service uses to monitor the health of our clusters. Further, Azure Linux is aligned with Kubernetes versions, features, and update paths that we’re already accustomed to.
In essence, migrating our nodes to Azure Linux helped us leverage a more secure, efficient, and AKS optimized Linux OS for our clusters, all without sacrificing essential features or compatibility.
Migrating node pools
The traditional path to migrate node pools today starts with creating new node pools, cordoning, and draining existing node pools, and then deleting the existing node pools. This method of migration can be very manual, time consuming, and can require tedious coordination for services that are resource constrained. By utilizing the combined functionality of Cluster Autoscaler, Node Problem Detector, and Draino, we were able to gradually migrate workloads to the Azure Linux node pool as we scaled down our existing node pools.
Using Cluster Autoscaler to migrate from one node pool to another has a few limitations:
While user node pools can be scaled down to zero, system node pools must always have at least one node. It is possible, though, to convert a system node pool into a user node pool (and then scale it to zero).
Cluster Autoscaler will not automatically remove the existing node pool from the cluster. Nevertheless, node pools that were scaled down to zero do not consume quota or generate costs.
During the migration process, the cluster will briefly have a higher node count than before as new nodes start and work is drained from the existing node pool to the new one.
Tainting Existing Node pools: The existing node pools should be marked with a specific taint. This taint acts as a signal that these node pools should no longer schedule new workloads and prepare them for migration.
Setting Node Condition: The Node Problem Detector (NDP) is configured to watch for the specific taint applied in the previous step. Upon detecting the taint, NDP sets a permanent condition on the affected nodes, indicating that they are ready for the migration process.
Node Drainage: Draino monitors for the condition set by NDP and responds by draining the tainted nodes. Draining ensures that all pods currently running on the old nodes/nodepools are evicted and rescheduled elsewhere in the cluster.
Node Deletion and Replacement: Once a node has been fully drained, the Cluster Autoscaler will mark the node for deletion. Subsequently, to keep optimal capacity across the cluster, the AutoScaler will provision new nodes within the newly created Azure Linux node pools.
Workload Redistribution: As new Azure Linux nodes become available, the workloads previously running on the existing nodes are automatically shifted to the Azure Linux nodes. The Cluster Autoscaler manages the distribution of these workloads to maintain optimal performance and resource utilization.
Necessary Tooling & Prerequisites
Below is a list of tools and prerequisites necessary for completing the examples in this guide. Ensure these are installed and configured before proceeding:
- Docker - A platform for developing, shipping, and running applications inside containers.
- Azure CLI - A command-line tool for interacting and managing Azure resources.
- kubectl - A command-line interface for running commands against Kubernetes clusters.
- Helm - A package manager for Kubernetes, simplifying deployment and management of applications.
- yq - A portable command-line YAML, JSON and XML processor that allows you to query and update data structures.
Defining essential environment variables
These variables will be consistently referenced throughout this document/example.
Configuring Azure CLI with provided environment variables
Ensure the environment variables defined earlier are set before executing these commands:
Identifying node pools for migration
Before migration, review the node pools within the cluster. Use the following command to list them in a table format for easier selection:
Here’s what you can expect to see after running the command above:
“Cloning” a node pool for further migration/deletion
In this instance, we’ll copy a node pool configuration, updating only its name and osSku to create a fresh node pool:
Configuring the Node Problem Detector (NDP)
This section outlines the steps to set up NDP for our goals. NDP will specifically monitor the NodePoolRemoved taint and once it is detected, NDP will assign a condition to the affected nodes. Then, Draino will use this condition to identify and process the nodes for safe removal, ensuring node pool maintenance aligns with cluster health and reliability requirements.
Customizing NDP container image (to include curl)
Save the contents below into a ‘Dockerfile’ file to build a custom NPD image (including ‘curl’)
NDP Helm Chart custom values (npd-values.yaml)
Values specific for our example, basically adding the NDP custom plug-in called ‘NodeTaintMonitor’ (please )
Installing NPD via Helm Chart
To install the NPD Helm Chart, we will follow the instructions provided by the project on GitHub.
Setting up Draino
Draino is no longer being actively maintained. To ensure security and stability, it’s recommended to seek an updated version that includes all the necessary security patches.
Tainting the old node pools
We welcome you to try out this demo and let us know how it goes! Questions and feedback can be posted on Azure Linux’s GitHub issues page.