Everything you want to know about ephemeral OS disks and Azure Kubernetes Service (AKS)

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

This article analyzes in-depth the available configuration settings for the ephemeral OS disk in Azure Kubernetes Service (AKS). With ephemeral OS disks, you see lower read/write latency on the OS disk of AKS agent nodes since the disk is locally attached. You will also get faster cluster operations like scale or upgrade thanks to faster re-imaging and boot times. You can use the Bicep modules in this GitHub repository to deploy an AKS cluster and repeat the tests described in this article.

 

Ephemeral OS disks for Azure VMs

Ephemeral OS disks are created on the local virtual machine (VM) storage and not saved to the remote Azure Storage, as when using managed OS disks. For more information on the performance of a managed disk, see Disk allocation and performance. Ephemeral OS disks work well for stateless workloads, where applications are tolerant of individual VM failures but are more affected by VM deployment time or reimaging of individual VM instances. With Ephemeral OS disks, you get lower read/write latency to the OS disk and faster VM reimage. The key features of ephemeral disks are the following:

 

  • Ideal for stateless applications and workloads.
  • Supported by the Azure Marketplace, custom images, and Azure Compute Gallery 
  • Ability to fast reset or reimage virtual machines and scale set instances to the original boot state.
  • Lower latency, similar to a temporary disk.
  • Ephemeral OS disks are free; you incur no storage cost for OS disks.
  • Available in all Azure regions.

 

The following table remarks the main differences between persistent and ephemeral OS disks:

 

 

Persistent OS Disk

Ephemeral OS Disk

Size limit for OS disk

2 TiB

Cache size or temp size for the VM size or 2040 GiB, whichever is smaller. For the cache or temp size in GiB, see DSESMFS, and GS

VM sizes supported

All

VM sizes that support Premium storage such as DSv1, DSv2, DSv3, Esv3, Fs, FsV2, GS, M, Mdsv2, Bs, Dav4, Eav4

Disk type support

Managed and unmanaged OS disk

Managed OS disk only

Region support

All regions

All regions

Data persistence

OS disk data written to OS disk are stored in Azure Storage

Data written to OS disk is stored on local VM storage and isn't persisted to Azure Storage.

Stop-deallocated state

VMs and scale set instances can be stop-deallocated and restarted from the stop-deallocated state

Not Supported

Specialized OS disk support

Yes

No

OS disk resize

Supported during VM creation and after VM is stop-deallocated

Supported during VM creation only

Resizing to a new VM size

OS disk data is preserved

Data on the OS disk is deleted, OS is reprovisioned

Redeploy

OS disk data is preserved

Data on the OS disk is deleted, OS is reprovisioned

Stop/ Start of VM

OS disk data is preserved

Not Supported

Page file placement

For Windows, page file is stored on the resource disk

For Windows, page file is stored on the OS disk (for both OS cache placement and Temp disk placement).

Maintenance of VM/VMSS using healing

OS disk data is preserved

OS disk data is not preserved

Maintenance of VM/VMSS using Live Migration

OS disk data is preserved

OS disk data is preserved

 

Placement options for Ephemeral OS disks

You can store ephemeral OS disks on the virtual machine's OS cache disk or temporary storage SSD (also known as resource disk). When deploying a virtual machine or a virtual machine scale set, you can use the DiffDiskPlacement property to specify where to place the Ephemeral OS disk, whether in the cache or resource disk.

 

Size requirements

As mentioned above, you can choose to deploy ephemeral OS disks on the VM cache or VM temp disk. The image OS disk's size should be less than or equal to the temp/cache size of the VM size chosen. For example, if you want to opt for OS cache placement, the Standard Windows Server images from the Marketplace are about 127 GiB, meaning that you need a VM size with a cache equal to or larger than 127 GiB. The Standard_DS3_v2 has a cache size of 127 GiB, which is large enough. In this case, the Standard_DS3_v2 is the smallest size in the DSv2 series that you can use with this image.

 

If you want to opt for Temp disk placement: the Standard Ubuntu server image from Marketplace is about 30 GiB. The temp disk size must be equal to or larger than 30 GiB to enable Ephemeral OS disk on the temporary storage. Standard_B4ms has a temporary storage size of 32 GiB, which can fit the 30 GiB OS disk. Upon creation of the VM, the temp disk space would be 2 GiB.

 

If you place the ephemeral OS disk in the temporary storage disk, the final size of the temporary disk will equal the initial size of the temporary disk size minus the OS image size. If you place the ephemeral OS disk in the temporary storage disk, the final size of the temporary disk will equal the initial size of the temporary disk size minus the OS image size. In addition, the ephemeral OS disk will share the IOPS with the temporary storage disk as per the VM size you selected. Ephemeral disks also require that the VM size supports Premium storage. The sizes usually have an s in the name, like DSv2 and EsV3. For more information, see Azure VM sizes for details around which sizes support Premium storage.

 


Note

Ephemeral disks will not be accessible through the portal. You will receive a "Resource not Found" or "404" error when trying to access an ephemeral OS disk.

 

Unsupported features

Ephemeral OS disks do not support the following features:

 

  • Capturing VM images
  • Disk snapshots
  • Azure Disk Encryption
  • Azure Backup
  • Azure Site Recovery
  • OS Disk Swap

 

AKS and Ephemeral OS disks

Azure automatically replicates data stored in the managed OS disk of a virtual machine to Azure storage to avoid data loss in case the virtual machine needs to be relocated to another host. Generally speaking, containers are not designed to have local state persisted to the managed OS disk, hence this behavior offers limited value to AKS hosted while providing some drawbacks, including slower node provisioning and higher read/write latency. There are a few exceptions where Kubernetes pods may need persisting data to the local storage of the OS disks:

 

  • EmptyDir: an emptyDir volume is created when a pod is assigned to an agent node and exists as long as that pod is running on that node. As the name says, the emptyDir volume is initially empty. All containers in the pod can read and write the same files in the emptyDir volume, even if the volume can be mounted at the same or different paths in each container. When a pod is removed from a node, the data in the emptyDir is deleted permanently. EmptyDir volumes can be used in the following scenarios:
    • Checkpointing long computation or data sorting for recovery from crashes
    • Temporary storage area for application logs

 

Depending on your environment, emptyDir volumes are stored on any storage system used by agent nodes such as a managed disk or a temporary storage SSD, or network storage. As we will see in the remainder of this article, AKS provides options to store emptyDir volumes in the OS disk or temporary disk of an agent node.

  • HostPath: hostPath volume mounts a file or directory from the host agent node's filesystem into a pod. HostPath volumes present many security risks, and it is a best practice to avoid using this kind of volume whenever possible. When a HostPath volume must be used, it should be scoped to only the required file or directory, and mounted as ReadOnly. Here are a few situations where using a hostPath volume:
    • Running a container that needs access to Docker internals; use a hostPath of /var/lib/docker
    • Running cAdvisor in a container; use a hostPath of /sys

 

Ephemeral OS disks are stored only on the host machine, hence they provide lower read/write latency, along with faster node scaling and cluster upgrades.

When a user does not explicitly request managed OS disks (e.g. using the --node-osdisk-type Managed parameter in an az aks create or in an az aks nodepool add command), AKS will default to ephemeral OS disks whenever possible for a given node pool configuration. The first requisite to using ephemeral OS disks are choosing a VM series for this feature, the second requisite is making sure that the OS disk can fit in the VM cache or temporary storage SSD. Let's make a couple of examples with two different VM series:

 

DSv2-series

The general purpose DSv2-series supports or does not support the following features:

 

 

This VM Series supports both VM cache and temporary storage SSD. High Scale VMs like DSv2-series that leverage Azure Premium Storage has a multi-tier caching technology called BlobCache. BlobCache uses a combination of the host RAM and local SSD for caching. This cache is available for the Premium Storage persistent disks and VM local disks. The VM cache can be used for hosting an ephemeral OS disk. When a VM series supports the VM cache, its size depends on the VM series and VM size. The VM cache size is indicated in parentheses next to IO throughput ("cache size in GiB").

 

 

Size

vCPU

Memory: GiB

Temp storage (SSD) GiB

Max data disks

Max cached

and temp storage

throughput:

IOPS/MBps

(cache size in GiB)

Max uncached disk throughput: IOPS/MBps

Max NICs

Expected network bandwidth (Mbps)

Standard_DS1_v21

1

3.5

7

4

4000/32 (43)

3200/48

2

750

Standard_DS2_v2

2

7

14

8

8000/64 (86)

6400/96

2

1500

Standard_DS3_v2

4

14

28

16

16000/128 (172)

12800/192

4

3000

Standard_DS4_v2

8

28

56

32

32000/256 (344)

25600/384

8

6000

Standard_DS5_v2

16

56

112

64

64000/512 (688)

51200/768

8

12000

 

Using the AKS default VM size Standard_DS2_v2 with the default OS disk size of 100 GiB as an example, this VM size supports ephemeral OS disks but only has 86 GiB of cache size. This configuration would default to managed OS disks if the user does not specify explicitly. If a user explicitly requested ephemeral OS disks, they would receive a validation error.

If a user requests the same Standard_DS2_v2 with a 60 GiB OS disk, this configuration would default to ephemeral OS disks: the requested size of 60GiB is smaller than the maximum cache size of 86 GiB.

Using Standard_D8s_v3 with 100 GiB OS disk, this VM size supports ephemeral OS and has 200 GiB of VM cache space. If a user does not specify the OS disk type, the node pool would receive ephemeral OS by default.

When using the Azure CLI to create an AKS cluster or add a node pool to an existing cluster, ephemeral OS requires at least version 2.15.0 of the Azure CLI.

 

Ebdsv5-series

The memory-optimized Ebdsv5-series supports the following features:

 

 

The last generation VM series don’t have both a VM cache and temporary storage, they only have a larger temporary storage as shown in the following table.

 

Size

vCPU

Memory: GiB

Temp storage (SSD) GiB

Max data disks

Max temp storage throughput: IOPS / MBps

Max uncached storage throughput: IOPS / MBps

Max burst uncached disk throughput: IOPS/MBp

Max NICs

Network bandwidth

Standard_E2bds_v5

2

16

75

4

9000/125

5500/156

10000/1200

2

10000

Standard_E4bds_v5

4

32

150

8

19000/250

11000/350

20000/1200

2

10000

Standard_E8bds_v5

8

64

300

16

38000/500

22000/625

40000/1200

4

10000

Standard_E16bds_v5

16

128

600

32

75000/1000

44000/1250

64000/2000

8

12500

Standard_E32bds_v5

32

256

1200

32

150000/1250

88000/2500

120000/4000

8

16000

Standard_E48bds_v5

48

384

1800

32

225000/2000

120000/4000

120000/4000

8

16000

Standard_E64bds_v5

64

512

2400

32

300000/4000

120000/4000

120000/4000

8

20000

 

Using the Standard_E2bds_v5 with the default OS disk size of 100 GiB as an example, this VM size supports ephemeral OS disks but only has 75 GiB of temporary storage. This configuration would default to managed OS disks if the user does not specify explicitly. If a user explicitly requested ephemeral OS disks, they would receive a validation error.

If a user requests the same Standard_E2bds_v5 with a 60 GiB OS disk, this configuration would default to ephemeral OS disks: the requested size of 60 GiB is smaller than the maximum temporary storage of 75 GiB.

Using Standard_E4bds_v5 with 100 GiB OS disk, this VM size supports ephemeral OS and has 150 GiB of temporary storage. If a user does not specify the OS disk type, the node pool would receive ephemeral OS by default.

 

Use Ephemeral OS on new clusters

You can configure an AKS cluster to use ephemeral OS disks at provisioning time. For example, when creating a new cluster with the Azure CLI, you can use the --node-osdisk-type Ephemeral parameter in an az aks create command, as shown below:

 

az aks create \ --name myAksCluster \ --resource-group myResourceGroup \ --node-vm-size Standard_DS3_v2 \ --node-osdisk-type Ephemeral

 

If you want to create a regular cluster using managed OS disks, you can do so by specifying --node-osdisk-type Managed.

 

Use Ephemeral OS on existing clusters

You can configure a new node pool to use ephemeral OS disks at provisioning time. For example, when creating a new cluster with the Azure CLI, you can use the --node-osdisk-type Ephemeral parameter in an az aks create command, as shown below:

 

az aks nodepool add \ --name myNodePool \ --cluster-name myAksCluster \ --resource-group myResourceGroup \ --node-vm-size Standard_DS3_v2 \ --node-osdisk-type Ephemeral

 

osDiskType and kubeletDiskType

As we have seen so far, when creating a new AKS cluster or adding a new node pool to an existing cluster, you can use the osDiskType parameter to specify the OS disk type:

 

  • Ephemeral (default): the OS disk is created as an ephemeral OS disk in the VM cache or temporary storage, depending on the selected VM series and size.
  • Managed: the OS disk is created as a network-attached managed disk.

 

Another setting that you can specify is kubeletDiskType. This parameter determines the placement of emptyDir volumes, container runtime data root, and Kubelet ephemeral storage.

 

  • OS (default): emptyDir volumes, container runtime data root, and Kubelet ephemeral storage is hosted by the OS disk, no matter if this is managed or ephemeral.
  • Temporary: emptyDir volumes, container runtime data root, and Kubelet ephemeral storage is hosted by the temporary storage.

 

I conducted some tests and tried out all the possible combinations of the values for the kubeletDiskType and osDiskType parameters to understand how the location of container images, emptyDir volumes, container images, and container logs varies depending on the current selection.

I created four different AKS clusters with the Standard_D8s_v3 VM size and four AKS clusters with the Standard_E16bds_v5 VM size and conducted some tests. Here are the results.

 

Dsv3-series

The Dsv3-series supports or does not support the following features:

 


As you can see in the following table, the Standard_D4s_v3 VM size has a temporary storage of 32 GiB and a VM cache of 100 GiB.

 

 

vCPU

Memory: GiB

Temp storage (SSD) GiB

Max data disks

Max cached and temp storage throughput: IOPS/MBps (cache size in GiB)

Max burst cached and temp storage throughput: IOPS/MBps2

Max uncached disk throughput: IOPS/MBps

Max burst uncached disk throughput: IOPS/MBps1

Max NICs/ Expected network bandwidth (Mbps)

Standard_D2s_v32

2

8

16

4

4000/32 (50)

4000/200

3200/48

4000/200

2/1000

Standard_D4s_v3

4

16

32

8

8000/64 (100)

8000/200

6400/96

8000/200

2/2000

Standard_D8s_v3

8

32

64

16

16000/128 (200)

16000/400

12800/192

16000/400

4/4000

Standard_D16s_v3

16

64

128

32

32000/256 (400)

32000/800

25600/384

32000/800

8/8000

Standard_D32s_v3

32

128

256

32

64000/512 (800)

64000/1600

51200/768

64000/1600

8/16000

Standard_D48s_v3

48

192

384

32

96000/768 (1200)

96000/2000

76800/1152

80000/2000

8/24000

Standard_D64s_v3

64

256

512

32

128000/1024 (1600)

128000/2000

80000/1200

80000/2000

8/30000

 

Here are the results of my tests with the Standard_D4s_v3 VM size. Let’s see all the possible combinations of the values for the kubeletDiskType and osDiskType parameters and the location of container images, emptyDir volumes, container images, and container logs varies depending on each selection.

 

 

kubeletDiskType

OS

Temporary

osDiskType

Managed

The root directory / is hosted by the managed disk. This includes the /var/lib/kubelet directory that contains kubelet data and /var/lib/containerd directory that contains container images. The managed disk hosts the OS, emptyDir volumes, writeable layers, container images, and logs. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-managed-22502302-vmss000000:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0   100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0  32G  0 disk

└─sdb1    8:17   0  32G  0 part /mnt

sr0      11:0    1  728K  0 rom 

 

The sda device is the managed disk, while the sdb device is the local temporary storage SSD. You can run the ls -alF /mnt command to list the files and directories under the temporary storage SSD.

 

azadmin@aks-managed-22502302-vmss000000:~$ ls -alF /mnt

total 32

drwxr-xr-x  5 root root  4096 Jun  1 09:36 ./

drwxr-xr-x 22 root root  4096 Jun  1 09:35 ../

drwxr-xr-x  2 root root  4096 Jun  1 09:36 containers/

drwxr-xr-x  2 root root  4096 Jun  1 09:36 docker/

drwx------  2 root root 16384 Jun  1 09:35 lost+found/

 

The emptyDir volume for a pod is in a directory under /var/lib/kubelet/pod/{podid}/volumes/

kubernetes.io~empty-dir/ on the managed disk sda. The total size for the kubelet data (including emptyDir volumes) and containerd data (e.g., container images) is equal to the total size of the managed disk, 100 GiB in this test, minus the space occupied by the Linux OS and other packages.

 

 

The root directory / is hosted by the managed disk. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-89377075-vmss000000:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0   32G  0 disk

└─sdb1    8:17   0   32G  0 part /mnt

sr0      11:0    1  732K  0 rom

 

The sda device is the managed disk, while the sdb device is the local temporary storage SSD. You can run the ls -alF /mnt command to list the files and directories under the temporary storage SSD.

 

azadmin@aks-user-89377075-vmss000000:~$ ls -alF /mnt

total 36

drwxr-xr-x  6 root root  4096 Jun  1 10:20 ./

drwxr-xr-x 22 root root  4096 Jun  1 10:19 ../

drwx--x--x  4 root root  4096 Jun  1 10:19 aks/

drwxr-xr-x  2 root root  4096 Jun  1 10:20 containers/

drwxr-xr-x  2 root root  4096 Jun  1 10:20 docker/

drwx------  2 root root 16384 Jun  1 10:19 lost+found/

 

You can install and run the tree command to see the files and directories under the /mnt/aks directory in the temporary storage.

 

azadmin@aks-user-89377075-vmss000000:~$ tree /mnt/aks -L 2

/mnt/aks

├── containers

│   ├── io.containerd.content.v1.content

│   ├── io.containerd.grpc.v1.cri

│   ├── io.containerd.metadata.v1.bolt

│   ├── io.containerd.runtime.v1.linux

│   ├── io.containerd.runtime.v2.task

│   ├── io.containerd.snapshotter.v1.aufs

│   ├── io.containerd.snapshotter.v1.btrfs

│   ├── io.containerd.snapshotter.v1.native

│   ├── io.containerd.snapshotter.v1.overlayfs

│   └── tmpmounts

└── kubelet

    ├── bootstrap-kubeconfig

    ├── cpu_manager_state

    ├── device-plugins

    ├── kubeconfig

    ├── memory_manager_state

    ├── pki

    ├── plugins

    ├── plugins_registry

    ├── pod-resources

    └── pods

 

The /var/lib/kubelet is a bind mount of the /mnt/aks/kubelet directory. Likewise, /var/lib/containerd is a bind mount of the /mnt/aks/containers directory.

The emptyDir volume for any pod is located in a directory under var/lib/kubelet/pods/{podid}/volumes/

kubernetes.io~empty-dir/ that is a mount point of the /mnt/aks/kubelet/pods/{podid}/volumes/
kubernetes.io~empty-dir/ directory on the sdb device hosted by the local temporary storage. The total size for the kubelet data (including emptyDir volumes) and containerd data (including container images) is equal to the total size of the sdb device hosted by the temporary storage, which is ~32 GiB in this test.

 

Ephemeral

The root directory / is hosted by the ephemeral disk in the VM cache. This includes the /var/lib/kubelet directory that contains the kubelet data, and /var/lib/containerd directory that contains container images. Hence the OS, emptyDir volumes, writeable layers, container images, and logs are hosted by the ephemeral disk. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-28564134-vmss000000:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0   32G  0 disk

└─sdb1    8:17   0   32G  0 part /mnt

sr0      11:0    1  726K  0 rom 

 

The sda device is the ephemeral disk in the VM cache, while the sdb device is the temporary storage SSD. You can run the ls -alF /mnt command to list the files and directories under the temporary storage SSD.

 

azadmin@aks-user-28564134-vmss000000:~$ ls -alF /mnt

total 32

drwxr-xr-x  5 root root  4096 May 13 10:32 ./

drwxr-xr-x 22 root root  4096 May 27 06:34 ../

drwxrwxrwx  2 root root  4096 May 13 10:32 containers/

drwxrwxrwx  2 root root  4096 May 13 10:32 docker/

drwxrwxrwx  2 root root 16384 May 13 10:31 lost+found/

 

The emptyDir volume for a pod is in a directory under /var/lib/kubelet/pods/{podid}/volumes/

kubernetes.io~empty-dir/ on the ephemeral disk sda hosted in the VM cache. The total size for the kubelet data (including emptyDir volumes) and containerd data (e.g., container images) is equal to the total size of ephemeral disk, 100 GiB in this test, minus the space occupied by the Linux operating system and other packages.

The root directory / is hosted by the ephemeral disk in the VM cache. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-23473020-vmss000000:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0   32G  0 disk

└─sdb1    8:17   0   32G  0 part /mnt

sr0      11:0    1  732K  0 rom

 

The sda device is the ephemeral disk in the VM cache, while the sdb device is the temporary storage SSD. You can run the ls -alF /mnt command to list the files and directories under the temporary storage SSD.

azadmin@aks-user-23473020-vmss000000:~$ ls -alF /mnt

total 36

drwxr-xr-x  6 root root  4096 Jun  1 09:29 ./

drwxr-xr-x 22 root root  4096 Jun  1 09:27 ../

drwx--x--x  4 root root  4096 Jun  1 09:28 aks/

drwxr-xr-x  2 root root  4096 Jun  1 09:29 containers/

drwxr-xr-x  2 root root  4096 Jun  1 09:29 docker/

drwx------  2 root root 16384 Jun  1 09:27 lost+found/

 

You can install and run the tree command to see the files and directories under the /mnt/aks directory in the temporary storage.

 

azadmin@aks-user-23473020-vmss000000:~$ tree /mnt/aks -L 2

/mnt/aks

├── containers

│   ├── io.containerd.content.v1.content

│   ├── io.containerd.grpc.v1.cri

│   ├── io.containerd.metadata.v1.bolt

│   ├── io.containerd.runtime.v1.linux

│   ├── io.containerd.runtime.v2.task

│   ├── io.containerd.snapshotter.v1.aufs

│   ├── io.containerd.snapshotter.v1.btrfs

│   ├── io.containerd.snapshotter.v1.native

│   ├── io.containerd.snapshotter.v1.overlayfs

│   └── tmpmounts

└── kubelet

    ├── bootstrap-kubeconfig

    ├── cpu_manager_state

    ├── device-plugins

    ├── kubeconfig

    ├── memory_manager_state

    ├── pki

    ├── plugins

    ├── plugins_registry

    ├── pod-resources

    └── pods

 

The /var/lib/kubelet is a bind mount of the /mnt/aks/kubelet directory. Likewise, /var/lib/containerd is a bind mount of the /mnt/aks/containers directory.

The emptyDir volume for any pod is located in a directory under /var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ that is mount point of the /mnt/aks/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ directory on the sdb device hosted by the local temporary storage. The total size for the kubelet data (including emptyDir volumes) and containerd data (including container images) is equal to the total size of the sdb device hosted by the temporary storage, which is ~32 GiB in this test.

 

 

Observations

  1. When setting the value of kubeletDiskType equal to OS, the operating system, container images, emptyDir volumes, writable container layers, and container logs are all hosted in the OS disk, no matter if the OS disk is managed or ephemeral.
  2. When setting the value of kubeletDiskType equal to Temporary, the operating system is hosted by the OS disk, no matter if the OS disk is managed or ephemeral, while container images, emptyDir volumes, and container logs are hosted by the temporary storage.
  3. When setting the kubeletDiskType to Temporary, kubelet (e.g., pod logs and emptyDir volumes) and containerd (e.g., container images) files are moved to the temporary storage SSD which size depends on the VM size. This configuration may have some caveats: disruptive host failures which could lead to loss of temporary storage disk could end up deleting kubelet data, which would require AKS to remediate (probably by reimaging).
  4. Temporary storage and VM cache have the same performance characteristics. We can conclude that N GiB of temporary storage cost more than N GiB of VM cache/ephemeral OS disk. Hence, deploying a cluster with osDiskType equal to Ephemeral and kubeletDiskType equal to OS and setting the osDiskSize equal to the maximum VM cache size is the recommended approach in case you need a lot of ephemeral disk space for container images and emptyDir volumes.

 

Ebdsv5-series

The Standard_E16bds_v5 VM size has a temporary storage of 600 GiB and no VM cache. Let’s see all the possible combinations of the values for the kubeletDiskType and osDiskType parameters and the location of container images, emptyDir volumes, container images, and container logs varies depending on each selection.

 

 

kubeletDiskType

OS

Temporary

osDiskType

Managed

The root directory / is hosted by the managed disk. This includes the /var/lib/kubelet directory that contains kubelet data, and /var/lib/containerd directory that contains container images. Hence the OS, emptyDir volumes, writeable layers, container images, and logs are hosted by the managed disk. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-39061525-vmss000001:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0  600G  0 disk

└─sdb1    8:17   0  600G  0 part /mnt

sr0      11:0    1  728K  0 rom 

 

The total size of the temporary storage for the Standard_E16bds_v5 VM is 600 GiB, while the OS disk size (osDiskSize) configured for the node pool is 100 GiB. As you can easily observe, the lsblk command returns a size of 100 GiB for the managed OS disk (sda), and 600 GiB for the temporary storage (sdb).

 

azadmin@aks-user-39061525-vmss000001:~$ ls -alF /mnt

total 32

drwxr-xr-x  5 root root  4096 Jun  9 12:47 ./

drwxr-xr-x 22 root root  4096 Jun  9 12:46 ../

drwxr-xr-x  2 root root  4096 Jun  9 12:47 containers/

drwxr-xr-x  2 root root  4096 Jun  9 12:47 docker/

drwx------  2 root root 16384 Jun  9 12:46 lost+found/

 

The emptyDir volume for a pod is in a directory under var/lib/kubelet/pods/{podid}/volumes/

kubernetes.io~empty-dir/ on the managed disk sda. The total size for the kubelet data (including emptyDir volumes) and containerd data (container images) is equal to the total size of the managed disk, 100 GiB in this test, minus the space occupied by the Linux operating system and other packages.

 

 

The root directory / is hosted by the managed disk. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-14788773-vmss000001:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0  600G  0 disk

└─sdb1    8:17   0  600G  0 part /mnt

sr0      11:0    1  732K  0 rom

 

The total size of the temporary storage for the Standard_E16bds_v5 VM is 600 GiB, while the OS disk size (osDiskSize) configured for the node pool is 100 GiB. As you can easily observe, the lsblk command returns a size of 100 GiB for the managed OS disk (sda), and 600 GiB for the temporary storage (sdb).  

 

azadmin@aks-user-14788773-vmss000001:~$ ls -alF /mnt

total 36

drwxr-xr-x  6 root root  4096 Jun  9 13:41 ./

drwxr-xr-x 22 root root  4096 Jun  9 13:40 ../

drwx--x--x  4 root root  4096 Jun  9 13:40 aks/

drwxr-xr-x  2 root root  4096 Jun  9 13:41 containers/

drwxr-xr-x  2 root root  4096 Jun  9 13:41 docker/

drwx------  2 root root 16384 Jun  9 13:40 lost+found/

 

You can install and run the tree command to see the files and directories under the /mnt/aks directory in the temporary storage.

 

azadmin@aks-user-14788773-vmss000001:~$ tree /mnt/aks -L 2

/mnt/aks

├── containers

│   ├── io.containerd.content.v1.content

│   ├── io.containerd.grpc.v1.cri

│   ├── io.containerd.metadata.v1.bolt

│   ├── io.containerd.runtime.v1.linux

│   ├── io.containerd.runtime.v2.task

│   ├── io.containerd.snapshotter.v1.aufs

│   ├── io.containerd.snapshotter.v1.btrfs

│   ├── io.containerd.snapshotter.v1.native

│   ├── io.containerd.snapshotter.v1.overlayfs

│   └── tmpmounts

└── kubelet

    ├── bootstrap-kubeconfig

    ├── cpu_manager_state

    ├── device-plugins

    ├── kubeconfig

    ├── memory_manager_state

    ├── pki

    ├── plugins

    ├── plugins_registry

    ├── pod-resources

    └── pods

 

The /var/lib/kubelet is a bind mount for the /mnt/aks/kubelet directory. Likewise, /var/lib/containerd is a bind mount of the /mnt/aks/containers directory.

The emptyDir volume for any pod is located in a directory under /var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ that is mount point of the /mnt/aks/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ directory on the sdb device hosted by the local temporary storage. The total size for the kubelet data (including emptyDir volumes) and containerd data (including container images) is equal to the total size of the sdb device hosted by the temporary storage, that is ~600 GiB in this test.

 

Ephemeral

The root directory / is hosted by the ephemeral disk. This includes the /var/lib/kubelet directory that contains the kubelet data, and /var/lib/containerd directory that contains container images. Hence the OS, emptyDir volumes, writeable layers, container images, and logs are hosted by the ephemeral disk in the local temporary storage. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-17557640-vmss000001:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0  500G  0 disk

└─sdb1    8:17   0  500G  0 part /mnt

sr0      11:0    1  728K  0 rom

 

The total size of the temporary storage for the Standard_E16bds_v5 VM is 600 GiB, while the OS disk size configured for the node pool is 100 GiB. As you can easily observe, the lsblk command returns a size of 100 GiB for the sda device which holds the ephemeral OS disk and 500 GiB for the sdb device. Since 100 + 500 = 600, we can conclude that both the sda and sdb devices are hosted in the local temporary storage.

 

azadmin@aks-user-17557640-vmss000001:~$ ls -alF /mnt

total 32

drwxr-xr-x  5 root root  4096 Jun  9 12:47 ./

drwxr-xr-x 22 root root  4096 Jun  9 12:46 ../

drwxr-xr-x  2 root root  4096 Jun  9 12:47 containers/

drwxr-xr-x  2 root root  4096 Jun  9 12:47 docker/

drwx------  2 root root 16384 Jun  9 12:46 lost+found/

 

The emptyDir volume for a pod is in a directory under var/lib/kubelet/pods/{podid}/volumes/

kubernetes.io~empty-dir/ on the ephemeral OS disk sda hosted by the temporary storage. The total size for the kubelet data (including emptyDir volumes) and containerd data (container images) is equal to the total size of the ephemeral OS disk, 100 GiB in this test, minus the space occupied by the Linux operating system and other packages.

The root directory / is hosted by the ephemeral disk in the VM cache. You can run the lsblk command to list the block devices attached to the agent node VM.

azadmin@aks-user-31173909-vmss000001:~$ lsblk

NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT

sda       8:0    0  100G  0 disk

├─sda1    8:1    0 99.9G  0 part /

├─sda14   8:14   0    4M  0 part

└─sda15   8:15   0  106M  0 part /boot/efi

sdb       8:16   0  500G  0 disk

└─sdb1    8:17   0  500G  0 part /mnt

sr0      11:0    1  732K  0 rom

 

The total size of the temporary storage for the Standard_E16bds_v5 VM is 600 GiB, while the OS disk size configured for the node pool is 100 GiB. As you can easily observe, the lsblk command returns a size of 100 GiB for the sda device which holds the ephemeral OS disk and 500 GiB for the sdb device. Since 100 + 500 = 600, we can conclude that both the sda and sdb devices are hosted in the local temporary storage.

 

azadmin@aks-user-31173909-vmss000001:~$ ls -alF /mnt

total 36

drwxr-xr-x  6 root root  4096 Jun  9 14:25 ./

drwxr-xr-x 22 root root  4096 Jun  9 14:24 ../

drwx--x--x  4 root root  4096 Jun  9 14:25 aks/

drwxr-xr-x  2 root root  4096 Jun  9 14:25 containers/

drwxr-xr-x  2 root root  4096 Jun  9 14:25 docker/

drwx------  2 root root 16384 Jun  9 14:24 lost+found/

 

You can install and run the tree command to see the files and directories under the /mnt/aks directory in the temporary storage.

 

azadmin@aks-user-31173909-vmss000001:~$ tree /mnt/aks -L 2

/mnt/aks

├── containers

│   ├── io.containerd.content.v1.content

│   ├── io.containerd.grpc.v1.cri

│   ├── io.containerd.metadata.v1.bolt

│   ├── io.containerd.runtime.v1.linux

│   ├── io.containerd.runtime.v2.task

│   ├── io.containerd.snapshotter.v1.aufs

│   ├── io.containerd.snapshotter.v1.btrfs

│   ├── io.containerd.snapshotter.v1.native

│   ├── io.containerd.snapshotter.v1.overlayfs

│   └── tmpmounts

└── kubelet

    ├── bootstrap-kubeconfig

    ├── cpu_manager_state

    ├── device-plugins

    ├── kubeconfig

    ├── memory_manager_state

    ├── pki

    ├── plugins

    ├── plugins_registry

    ├── pod-resources

    └── pods

 

The /var/lib/kubelet is a bind mount of the /mnt/aks/kubelet directory. Likewise, /var/lib/containerd is a bind mount of the /mnt/aks/containers directory.

The emptyDir volume for any pod is located in a directory under /var/lib/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ that is mount point of the /mnt/aks/kubelet/pods/{podid}/volumes/kubernetes.io~empty-dir/ directory on the sdb device hosted by the local temporary storage. The total size for the kubelet data (including emptyDir volumes) and containerd data (container images) is about 500 GiB, that is the total size of local temporary storage, 600 GiB in this test, minus the space occupied by the ephemeral OS disk (sda), 100 GiB in this test. 

 

Observations

  1. Using a recent VM series such as Ebdsv5-series allows you to have a unique, larger temporary storage rather than a smaller, separate temporary storage and VM cache like in Dsv3-series
  2. You can set osDiskType equal to Ephemeral, kubeletDiskType equal to OS, and the osDiskSize equal to the maximum temporary storage size.
  3. Alternatively, you can set osDiskType equal to Managed to host the operating system on a premium SSD whose size and performance tier depend on the OS disk size and dedicate all the temporary disk to the kubelet data (including emptyDir volumes) and containerd data (including container images) by setting kubeletDiskType equal to Temporary.

 

Show osDiskType and kubeletDiskType for an existing cluster

You can run the following Azure CLI command to find out the osDiskType and kubeletDiskType for each node pool of an existing AKS cluster:

 

az aks show \ --name myAksCluster \ --resource-group myResourceGroup \ --query 'agentPoolProfiles[].{name:name,osDiskType:osDiskType,kubeletDiskType:kubeletDiskType}' \ --output table \ --only-show-errors

 

The command returns the osDiskType and kubeletDiskType for the node pools of the selected AKS cluster:

 

Name OsDiskType KubeletDiskType ------- ------------ ----------------- gpu Managed OS keda Ephemeral OS main Ephemeral OS managed Managed OS user Ephemeral OS

 

Show osDiskType and kubeletDiskType for a node pool

You can run the following Azure CLI command to find out the osDiskType and kubeletDiskType for a given node pool of an existing AKS cluster:

 

az aks nodepool show \ --name user \ --cluster-name myAksCluster \ --resource-group myResourceGroup \ --query '{name:name,osDiskType:osDiskType,kubeletDiskType:kubeletDiskType}' \ --output table \ --only-show-errors

 

The command returns the osDiskType and kubeletDiskType for the specified node pool:

 

Name OsDiskType KubeletDiskType ------ ------------ ----------------- user Ephemeral OS

 

EmptyDir Test

You can use the following YAML manifest to create a pod with three containers, each mounting the same emptyDir volume using a different mount path and writing a separate file to the same directory.

 

apiVersion: v1 kind: Pod metadata: name: emptydir-pod spec: nodeName: node-name containers: - image: busybox imagePullPolicy: IfNotPresent name: busybox-1 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" command: ['sh', '-c', 'echo "The Bench Container 1 is Running" > /demo1/demo1.txt ; sleep 3600'] volumeMounts: - mountPath: /demo1 name: demo-volume - image: busybox imagePullPolicy: IfNotPresent name: busybox-2 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" command: ['sh', '-c', 'echo "The Bench Container 2 is Running" > /demo2/demo2.txt ; sleep 3600'] volumeMounts: - mountPath: /demo2 name: demo-volume - image: busybox imagePullPolicy: IfNotPresent name: busybox-3 resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m" command: ['sh', '-c', 'echo "The Bench Container 3 is Running" > /demo3/demo3.txt ; sleep 3600'] volumeMounts: - mountPath: /demo3 name: demo-volume volumes: - name: demo-volume emptyDir: {}

 

Next Steps

 

 

Thanks

Thanks for reading this article! If you have any feedback, please write a comment below or submit an issue or a pull request on GitHub. If you found this article and companion sample useful, please like the article below and give a star to the project on GitHub, thanks.

 

Conclusion

The recommended configuration is using the osDiskType equal to Ephemeral, kubeletDiskType equal to OS, and the osDiskSize equal to the maximum VM cache or temporary storage size, depending on the VM series selected for the agent nodes.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.