Best practices to harden your AKS environment

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Hi,

AKS takes more and more space in the Azure landscape, and there are a few best practices that you can follow to harden the environment and make it as secure as possible. As a preamble, remember that containers all share the kernel through system calls, so the level of isolation in the container world is not as strong as with virtual machines, and even more as with physical hosts. Mistakes can quickly lead to security issues.

1. Hardening the application itself

This might sound obvious but one of the best ways to defend against malicious attacks, is to use bullet proof code. There is no way you'll be 100% bullet proof, but a few steps can be taken to maximize the robustness:

Try to use up-to-date libraries in your code (NuGet, npm, etc.), because as you know, most of your code is actually not yours.
Make sure that any input is validated, any memory allocation is well under control, should you not use frameworks with managed memory. Many vulnerabilities are memory-related (Buffer overflow, Use-after-free, etc.).
Rely on well-known security standards and do not invent your own stuff.
Use SAST tools to perform static code analysis using specialized software such as Snyk, Fortify, etc.
Try to integrate security-related tests in your integration tests

2. Hardening container images

I've seen countless environments where the docker image itself is not hardened properly. I wrote a full blog post about this, so feel free to to read it https://techcommunity.microsoft.com/t5/azure-developer-community-blog/hardening-an-asp-net-container-running-on-kubernetes/ba-p/2542224. I'll summarize it here, in a nutshell:

Do not expose ports below 1024, because this requires extra capabilities
Specify another user than root
Change ownership of the container's file system

3. Scanning container images

Most of the times, we are using base images to build our own images, and most of the times, these base images have vulnerabilities. Use specialized software such as Snyk, Falco, Cloud Defender for Containers, etc. to identify them. Once identified, you should:

Try to stick to the most up-to-date images as they often include security patches
Try to use a different base image. Usually light images such as Alpine-based ones are a good start because they embed less tools and libraries, so are less likely to have vulnerabilities.
Make a risk assessment against the remaining vulnerabilities and see if that's really applicable to your use case. A vulnerability does not automatically mean that you are at risk. You might have some other mitigations in place that would prevent an exploit.

To push the shift left principle to the maximum, you can use Snyk's docker scan operation, right from the developer's machine to already identify vulnerabilities. Although Snyk is a paid product, you can scan a few images for free.

4. Hardening K8s deployments

In the same post as before (https://techcommunity.microsoft.com/t5/azure-developer-community-blog/hardening-an-asp-net-container-running-on-kubernetes/ba-p/2542224), I also explain how to harden the K8s deployment itself. In a nutshell,

Make sure to drop all capabilities and only add the needed ones if any
Do not use privileged containers nor allow privilege escalation (make values explicit)
Try to stick to a read only file system whenever possible
Specify user/group other than root

5. Request - Limits declaration

Although this might not be seen as a potential security issue, not specifying memory requests and limits can lead to an arbitrary eviction of other pods. Malicious users can take advantage of this to spread chaos within your cluster. So, you must always declare memory request and limits. You can optionally declare CPU requests/limits but this is not as important as memory.

6. Namespace-level logical isolation

K8s is a world where logical isolation takes precedence over physical isolation. So, whatever you do, you should make sure to adhere to the least privilege principle through proper RBAC configuration and proper network policies to control network traffic within the cluster, and potentially going outside (egress). Remember that by default, K8s is totally open, so every pod can talk to any other pod, whatever namespace it is located in. If you can't live with internal logical isolation only, you can also segregate workloads into different node pools and leverage Azure networking features such as NSGs to control network traffic at another level. I wrote an entire blog post on this: AKS, the elephant in the hub & spoke room, deep dive

6.1 RBAC

Role-based access control can be configured for both humans and systems, thanks to Azure AD and K8s RBAC. There are multiple flavors available for AKS. Whichever one you use, you should make sure to:

Define groups and grant them permissions using K8s roles
Define service accounts and let your applications leverage them
Prefer namespace-scoped permissions rather than cluster-scope ones

6.2 Namespace-scoped & global network policies

Traffic can be controlled using plain K8s network policies or tools such as Calico. Network policies can be used to control pod-level ingress/egress traffic.

7. Layer 7 protection

Because defense-in-depth relies on multiple ways to validate whether an ongoing operation is legal or not, you should also use a layer-7 protection, such as a Service Mesh or Dapr, which has some overlapping features with service meshes. The main difference between Dapr and a true Service Mesh is that applications using Dapr must be Dapr-aware while they don't need to know anything about a service mesh. The purpose of a layer-7 protection is to enable mTLS and fine-grained authorizations, in order to specify who can talk to who (on top of network policies). Most solutions today allow for fine-grained authorizations targeting operation-level scopes, when dealing with APIs. Dapr and Service Meshes come with many more juicy features that make you understand what a true Cloud native environment is.

8. Azure Policy

Azure Policy is the corner stone of a tangible governance in Azure in general, and AKS makes no exception. With Azure Policy, you'll have a continuous assessment of your cluster's configuration as well as a way to control what can be deployed to the cluster. Azure Policy leverages Gatekeeper to deny non-compliant deployments. You can start smoothly by setting everything to Audit mode and switch to Deny once ready. Azure Policy also allows you to whitelist known registries to make sure images cannot be pulled from everywhere.

9. Cloud Defender for Containers

Microsoft recently merged Defender for Registries and Defender for Kubernetes into Defender for Containers. There is a little bit of overlap with Azure Policy, but Defender also deploys DaemonSets that check for real-time threats. All incidents are categorized using the popular MITRE ATT&CK framework. One of the selling point is that Defender can handle any cluster, whether hosted on Azure or not. So, it is a multi-cloud solution. On top of assessing configuration and threats, Defender also ships with a built-in image scanning process leveraging Qualys behind the scenes. Images are scanned upon push operations as well as continuously to detect newer vulnerabilities that came after the push.

10. Private API server

This one is an easy one. Make sure to isolate the API server from internet. You can easily do that using Azure Private Link. If you can't do it for some reasons, try to at least restrict access to authorized IP address ranges.

11. Cluster boundaries

Of course, an AKS cluster is by design inside an Azure Virtual Network. The cluster can expose some workloads outside through the use of an ingress controller, and anything is subject to go outside of the cluster, through an egress controller and/or an appliance controlling the network traffic.

11.1 Ingress

Ingress can either be internet-facing callers or internal callers. A best practice is to isolate the AKS ingress controller (NGINX, Traefik, AGIC, etc.) from internet. You link it to an internal load balancer. Traffic that must be exposed to internet should be exposed through an Application Gateway, Front Door (using Private Link Service) or any other well-known non-Azure solution such as Barracuda, F5 etc. You should also distinguish pure UI traffic from API traffic. API traffic should also be filtered using an API gateway such as Azure APIM, Kong, Ambassador, etc. For "basic" scenarios, you might also offload JWT token validation to service meshes, but they will not have comparable features. You should for sure consider real API gateways for internet-facing APIs.

11.2 Egress

Pod-level egress traffic can be controlled by network policies or Calico, but also by most Service Meshes. Istio has even a dedicated egress controller, which can act as a proxy. On top of handling egress from within the cluster itself, it is a best practice to have a next-gen firewall waiting outside, such as Azure Firewall or third-party Network Virtual Appliances (NVA).

12. Keep consistence across clusters and across data centers

You start with one cluster, then 2, then a hundred. To keep some sort of consistency across cluster configurations, you can leverage Azure Policy. If your clusters are using on-premises or in another cloud, you can also use Azure Arc. Microsoft recently launched Azure Kubernetes Fleet Manager, which I haven't tried yet but is surely something to keep an eye on.