This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
If you're hosting any real time, highly transactional and synchronous workload, low network latency and the distributed resiliency of Kubernetes can be very relevant to you. In this post I hope to objectify performance and implications of these factors for L7 ingress in Kubernetes.
On We Go
Assuming you're leveraging the Azure CNI, as long as we don't dive into the intricacies of the Cloud Native Computing Foundation, when thinking about AKS and ingress load balancing options we typically have the option of selecting between Nginx or Azure Application Gateway (AppGW). Application Gateway Ingress Controller (AGIC) within K8s will manage AppGW for you. Without going into specifics, the configuration for AGIC can be more straightforward and allows you to kill two birds with one stone if you need a Web Application Firewall (WAF) and will leverage AppGW to implement that WAF. However, it's also known that there are some tradeoffs around performance and fault tolerance when leveraging AGIC in lieu of Nginx.
What performance and resiliency tradeoffs can we expect, though?
Well, generally speaking, I ponder:
- If requests originate from within the cluster, going through Nginx might be faster.
- Without having to step outside the cluster to go through AppGW, you might get to your destination faster despite a potential for proxying between nodes if the nginx controller is not already hosted on the same node as the destination pod.
- AppGW backend pools are configured with the PodIPs that make up your service. This means traffic gets routed directly to the pod hosting your app - no additional hops, no additional proxying.
- AGIC relies on the Azure Resource Manager (ARM) API to make the necessary updates to AppGW whenever you PodIPs change. The ARM API is external of the K8s cluster and a reliance on it will prove to not be as agile than relying on traditional K8s ClusterIPs for K8s Services. If AGIC cannot update AppGW fast enough when a PodIP changes for any reason, some requests might not make it to their destination.
- This is not a problem you need to worry about with Nginx or any ingress controller configured as a k8s service.
After reading above, if you'd like a refresher on k8s service proxies, see here.
I'm going to perform some tests which will shed light on the relevance of the above points. The tests will be ran on two configurations.
Config #1: Affinity: With a high level of pod affinity. Ref: Yaml Manifests
- 4 destination pods split between 2 nodes.
- 1 source pod generating traffic hosted in 1 of the nodes where the destination pods are present.
- 2 nginx pods split between 2 nodes, the same nodes as the destination pods.
- For Nginx, this scenario ensures a majority of the requests will have little to no proxying between nodes when reaching the destination.
- Perceived Performance Advantage: Nginx
Config #2: Anti-Affinity: With a high level of pod anti affinity. Ref: Yaml Manifests
- 4 destination pods, each hosted in a separate node
- 1 source pod generating traffic will hosted separate of the destination and nginx pods.
- 2 nginx pods will be hosted in the same node, separate from destination and source pods.
- For Nginx, this scenario ensures there will be additional proxying on the way to the destination - this should cause additional latency.
- Perceived Performance Advantage: AGIC
Here is a layout of the cluster with both configs implemented:
Here's a rundown of the k8s Ingresses configured:
- Because all traffic for both configs are going through the same instance of AppGW whereas there are two different Nginx controllers, I introduced host headers to the AGIC ingress to allow for traffic routing, but did the same for Nginx as well to have parity of compute overhead for all controllers.
- The destination pods are just nginx pods listening on /, serving up the standard nginx splash page. Any mention of "nginx" going forward refers to the L7 load balancer and not the destination pods.
Structure for Tests #1 and #2:
- There are 4 request bursts, each sending 100 requests - 1 burst for each combination of controller and host name.
- The requests are sent sequentially and averaged, i.e. each request is completed before sending the next.
- There's a warm-up before each burst.
- Refer to the code here: TrafficGenerate.cs
Test 1: Comparing request latency sent from outside of the cluster, through the internet.
We see negligible differences between AGIC and Nginx when sending requests from my local machine, through the internet:
Not much discrepancy is observed. Considering that all Nginx deployments have externalTrafficPolicy=Local, this asserts all traffic from the Azure LB will only be sent to a node hosting an nginx affinity controller. Considering that these nginx controllers are already hosted on nodes where the destination pods are also deployed, I thought there would be a more noticeable improvement in latency since there would be minimal to no further proxying necessary to reach the destination. Hopefully the tests from within the cluster yield more interesting results.
Test 2a: Comparing latency of requests sent between services in the same cluster when going through Nginx vs AppGW - all traffic sent to external LoadBalancer IP for respective nginx service.
Here are the results from the affinity traffic generator:
Nginx affinity traffic is 47% faster compared to AppGW.
Here are the results from the anti-affinity traffic generator:
Nginx anti-affinity traffic is 15% faster compared to AppGW.
Nginx is 38% faster when averaging for both scenarios.
Test 2b: What if we run the same test but route nginx traffic to the ClusterIP instead of the External, LoadBalancer IP?
- Would the affinity config see a substantial improvement since there would be minimal to no reliance on Azure Networking?
Here are the results for the affinity traffic generator:
No difference compared to LB IP test - nginx is 47% faster for both tests.
Here are the results for the anti-affinity traffic generator:
Nginx is 12% faster, a decline from the 15% difference when using the LB IP.
Test 3: How reliable is AGIC in updating AppGW configurations when Pod IPs change? Can we expect requests to be completed when pods go down for any reason?
- Nginx will be able to manage pods going down more effectively than AGIC - but how much better will it be?
- Code Ref: AGICARMTesting.cs
For my last test I repeated the following process 10 times on my local machine, through the internet:
- Delete random application pod asynchronously
- Send request to respective ingress
- Wait for the request to complete
- Wait 500 ms
You might think this test is somewhat unrealistic. You might be right. Go on.
These were the AGIC results for the affinity config:
- The median latency was ~165 ms, but there were two outliers each at ~30,000 ms, both which returned OKs.
- 50% of the requests were negatively impacted.
- 3 502 Bad Gateway errors
- 2 requests with prolonged latencies.
- Considering that the requests were ran over the internet, I would expect the results to be worse if the test was ran within the cluster. This is a concern if there's a dire dependence on low latency and fault tolerance.
- The requests that did succeed with baseline latency were lucky to land on a pod that was not impacted.
However unrealistic this test is, the real narrative is formed when we run the same test on the Nginx affinity config.
These were the Nginx results for the affinity config:
- There wasn't a single hiccup in the nginx environment.
- This is testament to how quickly the underlay network is updated when a pod is destroyed and it's a substantial advantage over AGIC when it comes to fault tolerance for the lifecycle of a pod.
I have some inferences and calls to action that I've taken away when comparing AppGW and Nginx performance.
- Discrepancy in latency through the internet is not too distinguishable.
- Discrepancy in latency within the cluster is notable.
- Affinity matters. Proximity of services will yield dividends when leveraging Nginx - more so if there are many inter-dependent microservices.
- For many customers, the discrepancies in latency observed might not be too relevant.
- The cost savings on the other hand might be very relevant.
- Calls to action:
- Avoid AGIC if you have a need for consistent, low latency:
- If you're early in your cloud native adoption, deploying a monolithic app and would like to Leverage AppGW WAF functionality, you can likely get away within leveraging AGIC.
- The configurational overhead between Nginx and AGIC are about the same.
- Consider multiple nginx ingress classes and/or pod anti-affinity amongst the nginx pods.
- There's an improvement in latency when the destination is present on the same node where nginx is also receiving traffic.
- Multiple classes (and inherently multiple deployments) of nginx can allow for deployment to nodes/nodePools where inter-dependent services are deployed.
- Pod Anti-affinity amongst nginx pods for a deployment ensure nginx is deployed across a wide array of a perceived minimum number of nodes, i.e. 4 nginx pods = 4 nodes deployed.
- If you'd like to leverage the WAF from AppGW, do so - but don't use AGIC - this will remove the dependence on the ARM API for routing to the backend pods.
- Leverage nginx within the cluster, configure AppGW WAF, configure AppGW backend to the nginx service IP.
- You might consider having k8s services leverage nginx internally, and having external requests go through AppGW WAF.
- You could always leverage Nginx WAF if you'd like to remove dependency on AppGW altogether - that's outside the scope of this post.
- Avoid AGIC if you have a need for consistent, low latency:
- Be aware of the AKS Web App Routing add-on that's currently in preview.
- It will better streamline the deployment nginx and its integration with Keyvault, Azure DNS, and OSM for e2e encryption.