Inside Look: How Azure Linux powers Confidential Containers on AKS

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Microsoft recently announced the public preview of Confidential Containers on Azure Kubernetes Services (AKS). This new offering builds upon our existing virtual machine (VM) isolated containers on AKS offering for sandboxing Kubernetes workloads. While VM isolation of containers provides customers with more robust pod sandboxing, the confidential container offering enables running workloads within encrypted VMs inside a hardware-based Trusted Execution Environment (TEE). Only the workload and minimal software stack running inside the TEE comprises the Trusted Computing Base (TCB) which can be fully measured, reproduced, and attested by customers. This means Microsoft, cloud administrators, and many infrastructure components historically required to be trusted by a standard workload, such as the host operating system (OS) and hypervisor, are now no longer within the TCB. In concert with attestation, keys for the protected workload can be securely released to trusted workloads only, thus allowing customers to run highly sensitive workloads remotely in the cloud.

In this blog post, we will introduce the technical stack that drives the confidential containers on AKS offering.

Confidential Containers on AKS Stack Overview

Normal Kubernetes workloads run inside ‘runc’ isolated containers, which are standard Linux processes isolated by namespaces and cgroups but share the same operating system kernel. When using Azure Kubernetes Service, these workloads run on AKS ‘node pools’ - Kubernetes nodes that are VMs on top of Microsoft Hypervisor running on Azure hardware. There are a few choices of VM node OS in AKS, but all of the following advancements are delivered through Azure Linux – Microsoft’s open source Linux distribution built to power Azure services. Azure Linux is optimized for Azure cloud services and offers a variety of features and benefits for services that run Linux workloads on Azure.

The existing preview offering for VM isolated container workloads, built using the Azure Linux OS, allows isolating workloads inside of nested utility VMs (UVMs). The aim of this offering is to provide performance similar to regular runc containers, but with stronger workload isolation properties since each workload is run inside of its own nested utility VM.

The confidential container stack builds on top of this VM isolation concept by utilizing AMD Secure Encrypted Virtualization-Secure Nested Paging (SEV-SNP) technology to achieve workload confidentiality. Recently, Azure introduced a new Confidential child capable VM series which allows the VM to create AMD SEV-SNP protected child VMs. This VM size is currently enabled through AKS. By leveraging these VM sizes in our AKS offering and by minimizing the components inside these confidential VMs through utilizing Kata Containers and Confidential Containers, we can deliver an experience that enables container workloads to run efficiently and confidentially inside of AMD SEV-SNP child VMs.

The illustration below provides an architectural overview of the key components in our confidential containers technology stack from the AKS VM’s view. We will dive into each component in more detail in the upcoming sections.

Figure 1: Components of the AKS Confidential Containers computing stack.

Nested Confidential Virtualization Stack

To start our journey through the overall technical stack, let’s first look at how we achieve nested virtualization of confidential child VMs.

Microsoft Hypervisor: At the heart of any virtualization stack is the hypervisor. In our scenario, we needed a highly performant hypervisor that also supported AMD SEV-SNP VMs. The Microsoft Hypervisor already has SEV-SNP support and is used at hyperscale in production, so it was a natural choice to use in our nested virtualization stack.
Azure Linux bootloader: In order to efficiently and securely load the hypervisor and Linux kernel into Host VM memory, we developed a bootloader that is responsible for loading a nested hypervisor into memory and booting Linux.
Azure Linux Host Kernel: Next, we needed to make sure our Linux kernel can interact with the Microsoft Hypervisor. This meant creating and adding the enhanced Linux kernel that supports interacting with nested confidential VMs using the Microsoft Hypervisor and AMD SEV-SNP to our Azure Linux distribution so it can be made easily available to the AKS node image composer.
Cloud Hypervisor: To manage these nested confidential child VMs, a specialized Virtual Machine Monitor (VMM) is needed. We added an enhanced cloud-hypervisor to the Azure Linux distribution which can coordinate with the enlightened Azure Linux host kernel to create nested SEV-SNP VMs using the Microsoft Hypervisor.
Azure Linux UVM Kernel: Finally, let’s look at the guest child VM. Since we expect scenarios where many UVMs will need to spin up and tear down during the lifetime of the host VM, each UVM needs to boot quickly. Therefore, we created an Azure Linux UVM-optimized kernel to minimize boot time and lower attack surface.

We integrated all of these components together into an enlightened Azure Linux AKS Container Host platform that has the ability to spin up nested confidential UVMs.

Secure Container Runtime: Kata Containers + Confidential Containers

With the nested confidential child VM stack introduced, the next step is to run containerized workloads within these confidential child VMs. With our previous success working with the open-source Kata Containers project, we chose to join and build upon the Confidential Containers community.

Confidential Containers is an open-source community working to enable cloud native confidential computing by leveraging Trusted Execution Environments (TEE) to protect containers and data. The community has also invested in the Kata Containers project as the basis for their initial work, due to the VM isolation properties that Kata Containers provides and that many TEE technologies require.

Looking closer, the first key component is the Kata Shim: a binary that is called by containerd as a plugin when a VM isolated or confidential pod is requested by an end user through the node’s kubelet. We introduced support for our nested virtualization stack into the Kata Shim so it can use cloud-hypervisor and the Microsoft Hypervisor on Azure Linux nodes to spin up the respective VM isolated UVM or the AMD SEV-SNP encrypted confidential UVM.

The other key component is the Kata Agent which runs inside the UVM and manages the pod. The Kata Shim communicates with the Kata Agent over ttrpc and this is how kubelet commands are transmitted and workload responses are received.

The Kata Containers project also provides code to assemble the UVM itself, which consists of a minimal kernel and user space environment specifically crafted to run Kata Agent and container workloads. From a confidential container standpoint, the entirety of the software stack inside the UVM forms the TCB. Our UVM composition is fully open source and can be measured and reproduced by customers. And when AKS builds the UVM image, the image is signed by Microsoft, and that signature can be verified through remote attestation.

We have open-sourced and upstreamed support for much of our work to the Kata Containers project. We have added Azure Linux support into the Kata Containers project, enabling UVMs to be created with Azure Linux’s distribution packages, and we worked with the upstream maintainers to improve the community’s CI test coverage. Microsoft is also sponsoring the kata containers CI testing resources within Azure. And the team is not done here – we are currently prepping the patches for adding our kata shim changes to invoke Azure Linux’s nested virtualization stack for launching nested confidential UVMs.

We just described the container runtime used to support deploying and managing confidential containers on AKS via nested confidential VMs and Kata Containers. However, hardware-based encryption alone is not sufficient to achieve confidentiality - unprotected I/O and control channels into the TEE need to be systematically secured, otherwise data can be exfiltrated from the TCB.

We will cover two central contributions to the security design that provide the confidentiality guarantees of this solution: Security Policy and Container Image Snapshotter.

Security Policy

As mentioned earlier, the Kata Agent is executed within the UVM environment, i.e., inside the hardware-based TEE, therefore the agent is part of the TCB. The agent provides a set of ttrpc APIs allowing the system components outside of the TEE to create and manage CVM-based Kubernetes pods. With this design, the underlying pod implementation remains transparent to the Kubernetes stack. From a confidentiality standpoint, the agent to shim communication represents a control channel crossing TCB boundary, therefore the agent must protect itself from the untrusted API calls. This self-protection is implemented using a Security Policy that can be fully attested and that is specified by the owners of the confidential pods. Each confidential pod can be annotated with a policy document that contains rules and data corresponding to each pod, using the industry standard Rego policy language. The enforcement of the Policy inside the UVM is implemented using the Open Policy Agent – a Graduated project of the Cloud Native Computing Foundation (CNCF).

We implemented the genpolicy tool located in the Kata tree. The tool can be used by customers to generate the policy document based on the customer’s corresponding standard Kubernetes pod manifest, and to annotate the document to the manifest. The generated policy document will describe all the calls to agent’s ttrpc API that are expected for creating and managing the respective pod.

Container creation is rejected by the agent’s policy enforcement when a command line, storage mount, execution security context, or environment variable that violates the given rules are detected.

We will present the security policy feature in more detail in an upcoming blog post. In that post, we will especially describe how the policy document is sent to Kata Agent during early pod creation, how the document becomes part of the TEE measurement, and how trust is established in the policy document through attestation.

Container Image Snapshotter

A property of the baseline Confidential Containers stack is that container images are pulled from within the UVM. From a resource utilization standpoint, these container image layers would be cached within the UVM’s memory-backed local filesystem, leading to potential UVM memory limit concerns. Additionally, from a confidentiality standpoint, pulling container images within the UVM significantly increases the size of the software inside the TCB. To address both concerns, we introduced a new tardev-snapshotter to the Confidential Containers stack in order to pull the container image layers for pods on the container host, which is outside the TCB, and thus these container image layers can now be shared between pods.

The tardev-snapshotter is implemented as a containerd plugin that pulls and manages the container image layers when a Kata pod, confidential or non-confidential, is created. Each container layer is exposed as a read-only virtio block device to the respective UVM(s). We protect the integrity of those block devices using the dm-verity technology of the Linux kernel. For each container image layer, we include the expected root hash of the dm-verity hash tree inside the policy document and this policy is enforced at runtime by the Kata Agent.

When the container inside the pod is started, the shim requests cloud-hypervisor to map the corresponding layers into the UVM. Each mounted block device represents a container image layer. These layers are represented by tar files. The agent uses our new tarfs kernel module to mount the dm-verity target block devices as ‘tarfs’ filesystem mounts, which ultimately provides the container filesystem.

The container image snapshotter feature goes hand in hand with security policy. The genpolicy tool downloads the container image layers for each of the containers specified by the input Kubernetes pod manifest and calculates the dm-verity root hash value for each of the layers. This way, each mapped container image layer becomes part of the TCB measurement.

We will present the snapshotter feature in more detail in an upcoming blog post so stay tuned.

Building Attestation Services on the Confidential Containers Stack

While this blog post is mainly focused on the Azure Linux nodes that drive confidential containers, we do want to give a brief overview on the technology that can be built on top of confidential containers. An integral part of any confidential computing stack is to provide the ability to build attestation services, and our offering is no exception. Customers can use attestation to verify their workloads run as TEE-isolated UVMs and check that the software stack, including the containers inside the UVM, are the stack expected by the customer. While our confidential container stack allows for building of any generic attestation client and service, our partner teams have built an attestation client that uses the existing Microsoft Azure Attestation Service. The Azure attestation client reads the signed AMD SEV-SNP attestation report, UVM signatures and measurements, and drives remote attestation with the Azure attestation service, allowing for scenarios such as secure key release or mounting encrypted filesystems.

Future Work

With our confidential containers on AKS offering now in preview and available to try, we are invested in learning about how Azure customers will use confidential containers for enabling confidential workloads with AKS.

The Azure Linux team plans to expand our TEE support with Intel TDX and confidential GPUs, invest in ways to reduce any performance gap between runc pods and confidential pods, and enable container streaming support in our snapshotter, all while continuing to design and collaborate in the open with the upstream Kata Containers and Confidential Containers communities.