This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
Monitoring the health and performance of an Azure Kubernetes Service(AKS) cluster effectively is a crucial task for the organizations. This ensures the stability, performance, and availability of containerized applications running on the cluster. This article shows how to deploy an Azure Kubernetes Service(AKS) cluster, Azure Monitor managed service for Prometheus, and Azure Managed Grafana for monitoring
the performance and health status of the cluster and workloads. The article also shows how to:
- Deploy the NGINX Ingress Controller via Helm and configure it to expose metrics in Prometheus format.
- Create an Azure Managed Grafana dashboard to analyze NGINX Ingress Controller metrics.
- Configure the Azure Kubernetes Service(AKS) Network Observability
- Create an Azure Managed Grafana dashboard to visualize Network Observability metrics in Prometheus format.
Prerequisites
- An active Azure subscription. If you don't have one, create a free Azure account before you begin.
- Visual Studio Code installed on one of the supported platforms along with the Bicep extension.
- Azure CLI version 2.50.0 or later installed. to install or upgrade, see Install Azure CLI.
aks-preview
Azure CLI extension of version 0.5.145 or later installed
You can run az --version
to verify the above version. Run the following command to install the aks-preview extension:
Run the following command to update to the latest version of the extension released:
Architecture
This sample provides a set of Bicep modules to deploy an Azure Kubernetes Service(AKS) cluster, an Azure Monitor managed service for Prometheus resource and an Azure Managed Grafana instance for monitoring the performance and health status of the cluster and workloads. The following diagram shows the architecture and network topology deployed by the sample:
Bicep modules are parametric, so you can choose any network plugin:
- Azure CNI with static IP allocation
- Azure CNI with dynamic IP allocation
- Azure CNI Powered by Cilium
- Azure CNI Overlay
- BYO CNI
- Kubenet
In addition, this sample shows how to deploy an Azure Kubernetes Service cluster with the following extensions and features:
- Istio-based service mesh add-on for Azure Kubernetes Service provides an officially supported and tested Istio integration for Azure Kubernetes Service (AKS).
- API Server VNET Integration allows you to enable network communication between the API server and the cluster nodes without requiring a private link or tunnel. AKS clusters with API Server VNET integration provide a series of advantages, for example, they can have public network access or private cluster mode enabled or disabled without redeploying the cluster. For more information, see Create an Azure Kubernetes Service cluster with API Server VNet Integration.
- Azure NAT Gateway to manage outbound connections initiated by AKS-hosted workloads.
- Event-driven Autoscaling (KEDA) add-on is a single-purpose and lightweight component that strives to make application autoscaling simple and is a CNCF Incubation project.
- Dapr extension for Azure Kubernetes Service (AKS) allows you to install Dapr, a portable, event-driven runtime that simplifies building resilient, stateless, and stateful applications that run on the cloud and edge and embrace the diversity of languages and developer frameworks. With its sidecar architecture, Dapr helps you tackle the challenges that come with building microservices and keeps your code platform agnostic.
- Flux V2 extension allows to deploy workloads to an Azure Kubernetes Service (AKS) cluster via GitOps. For more information, see GitOps Flux v2 configurations with AKS and Azure Arc-enabled Kubernetes
- Vertical Pod Autoscaling allows you to automatically sets resource requests and limits on containers per workload based on past usage. VPA makes certain pods are scheduled onto nodes that have the required CPU and memory resources. For more information, see Kubernetes Vertical Pod Autoscaling.
- Azure Key Vault Provider for Secrets Store CSI Driver provides a variety of methods of identity-based access to your Azure Key Vault.
- Image Cleaner to clean up stale images on your Azure Kubernetes Service cluster.
- Azure Kubernetes Service (AKS) Network Observability is an important part of maintaining a healthy and performant Kubernetes cluster. By collecting and analyzing data about network traffic, you can gain insights into how your cluster is operating and identify potential problems before they cause outages or performance degradation.
- Windows Server node pool allows running Windows Server containers on an Azure Kubernetes Service (AKS) cluster.
In a production environment, we strongly recommend deploying a private AKS cluster with Uptime SLA. For more information, see private AKS cluster with a Public DNS address. Alternatively, you can deploy a public AKS cluster and secure access to the API server using authorized IP address ranges.
The Bicep modules deploy the following Azure resources:
- Microsoft.ContainerService/managedClusters: A public or private Azure Kubernetes Service(AKS) cluster composed of a:
- A
system
node pool in a dedicated subnet. The default node pool hosts only critical system pods and services. The worker nodes have node taint which prevents application pods from beings scheduled on this node pool. - A
user
node pool hosting user workloads and artifacts in a dedicated subnet. - A
windows
node pool hosting Windows Server containers. This node pool is optionally created when the value of thewindowsAgentPoolEnabled
equalstrue
- A
- Microsoft.ManagedIdentity/userAssignedIdentities: a user-defined managed identity used by the AKS cluster to create additional resources like load balancers and managed disks in Azure.
- Microsoft.Compute/virtualMachines: Bicep modules can optionally create a jump-box virtual machine to manage the private AKS cluster.
- Microsoft.Network/bastionHosts: a separate Azure Bastion is deployed in the AKS cluster virtual network to provide SSH connectivity to both agent nodes and virtual machines.
- Microsoft.Network/natGateways: a bring-your-own (BYO) Azure NAT Gateway to manage outbound connections initiated by AKS-hosted workloads. The NAT Gateway is associated to the
SystemSubnet
,UserSubnet
, andPodSubnet
subnets. The outboundType property of the cluster is set touserAssignedNatGateway
to specify that a BYO NAT Gateway is used for outbound connections. NOTE: you can update theoutboundType
after cluster creation and this will deploy or remove resources as required to put the cluster into the new egress configuration. For more information, see Updating outboundType after cluster creation. - Microsoft.Storage/storageAccounts: this storage account is used to store the boot diagnostics logs of both the service provider and service consumer virtual machines. Boot Diagnostics is a debugging feature that allows you to view console output and screenshots to diagnose virtual machine status.
- Microsoft.ContainerRegistry/registries: an Azure Container Registry (ACR) to build, store, and manage container images and artifacts in a private registry for all container deployments.
- Microsoft.KeyVault/vaults: an Azure Key Vault used to store secrets, certificates, and keys that can be mounted as files by pods using Azure Key Vault Provider for Secrets Store CSI Driver. For more information, see Use the Azure Key Vault Provider for Secrets Store CSI Driver in an AKS cluster and Provide an identity to access the Azure Key Vault Provider for Secrets Store CSI Driver.
- Microsoft.Network/privateEndpoints: an Azure Private Endpoint is created for each of the following resources:
- Azure OpenAI Service
- Azure Container Registry
- Azure Key Vault
- Azure Storage Account
- API Server when deploying a private AKS cluster.
- Microsoft.Network/privateDnsZones: an Azure Private DNS Zone is created for each of the following resources:
- Azure OpenAI Service
- Azure Container Registry
- Azure Key Vault
- Azure Storage Account
- API Server when deploying a private AKS cluster.
- Microsoft.Network/networkSecurityGroups: subnets hosting virtual machines and Azure Bastion Hosts are protected by Azure Network Security Groups that are used to filter inbound and outbound traffic.
- Microsoft.Monitor/accounts: An Azure Monitor workspace is a unique environment for data collected by Azure Monitor. Each workspace has its own data repository, configuration, and permissions. Log Analytics workspaces contain logs and metrics data from multiple Azure resources, whereas Azure Monitor workspaces currently contain only metrics related to Prometheus. Azure Monitor managed service for Prometheus allows you to collect and analyze metrics at scale using a Prometheus-compatible monitoring solution, based on the Prometheus. This fully managed service allows you to use the Prometheus query language (PromQL) to analyze and alert on the performance of monitored infrastructure and workloads without having to operate the underlying infrastructure. The primary method for visualizing Prometheus metrics is Azure Managed Grafana. You can connect your Azure Monitor workspace to an Azure Managed Grafana to visualize Prometheus metrics using a set of built-in and custom Grafana dashboards.
- Microsoft.Dashboard/grafana: an Azure Managed Grafana instance used to visualize the Prometheus metrics generated by the Azure Kubernetes Service(AKS) cluster deployed by the Bicep modules. Azure Managed Grafana](https://learn.microsoft.com/en-us/azure/managed-grafana/overview) is a fully managed service for analytics and monitoring solutions. It's supported by Grafana Enterprise, which provides extensible data visualizations. This managed service allows to quickly and easily deploy Grafana dashboards with built-in high availability and control access with Azure security.
- Microsoft.OperationalInsights/workspaces: a centralized Azure Log Analytics workspace is used to collect the diagnostics logs and metrics from all the Azure resources:
- Azure OpenAI Service
- Azure Kubernetes Service cluster
- Azure Key Vault
- Azure Network Security Group
- Azure Container Registry
- Azure Storage Account
- Azure jump-box virtual machine
- Microsoft.Resources/deploymentScripts: a deployment script is used to run the
install-nginx-with-prometheus-metrics-and-create-sa.sh
Bash script that creates the namespace and service account for the sample application and installs the following packages to the AKS cluster via Helm. For more information on deployment scripts, see Use deployment scripts in Bicep -
Microsoft.Insights/actionGroups: an Azure Action Group to send emails and SMS notifications to system administrators when alerts are triggered.
The Bicep modules allow to deploy the following resources optionally:
- Microsoft.CognitiveServices/accounts: an Azure OpenAI Service with a GPT-3.5 model used by an AI application like a chatbot. Azure OpenAI Service gives customers advanced language AI with OpenAI GPT-4, GPT-3, Codex, and DALL-E models with Azure's security and enterprise promise. Azure OpenAI co-develops the APIs with OpenAI, ensuring compatibility and a smooth transition from one to the other.
- Microsoft.ManagedIdentity/userAssignedIdentities: a user-defined managed identity used by the chatbot application to acquire a security token via Azure AD workload identity to call the Chat Completion API of the ChatGPT model provided by the Azure OpenAI Service.
Note: you can find the architecture.vsdx
file used for the diagram under the visio
folder.
What is Bicep?
Bicep is a domain-specific language (DSL) that uses a declarative syntax to deploy Azure resources. It provides concise syntax, reliable type safety, and support for code reuse. Bicep offers the best authoring experience for your infrastructure-as-code solutions in Azure.
Azure Monitor managed service for Prometheus?
Azure Monitor managed service for Prometheus is a fully managed, highly scalable, and reliable monitoring service available in Azure. It offers a turnkey solution for collecting, querying, and alerting on metrics from AKS clusters. With Azure Managed Prometheus, you no longer need to deploy and manage Prometheus and Grafana within your clusters using a Helm chart. Instead, you can focus on extracting meaningful insights from the collected metrics. You can use a single Azure Monitor workspace to collect Prometheus metrics from a group of AKS clusters and use a single Azure Managed Grafana as a single pan of glass to visualize and aggregate Prometheus metrics collected in the Azure Monitor workspace from one or multiple AKS clusters.
The following figure shows the Azure Monitor managed service for Prometheus overview diagram:
For more information on Azure Monitor workspace and Azure Managed Prometheus, see the following articles:
- Azure Monitor managed service for Prometheus
- Collect Prometheus metrics from an AKS cluster
- Disable Prometheus metrics collection from an AKS cluster
- Collect Prometheus metrics from an Arc-enabled Kubernetes cluster
- Default Prometheus metrics configuration in Azure Monitor
- Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus
- Create and validate custom configuration file for Prometheus metrics in Azure Monitor
- Minimal ingestion profile for Prometheus metrics in Azure Monitor
- Scrape Prometheus metrics at scale in Azure Monitor
- Send Prometheus metrics to multiple Azure Monitor workspaces
- Troubleshoot collection of Prometheus metrics in Azure Monitor
- Configure remote write for Azure Monitor managed service for Prometheus using managed identity authentication
- Integrate KEDA with your Azure Kubernetes Service cluster
- Prometheus Azure Active Directory authorization proxy
Azure Managed Grafana
Azure Managed Grafana is a managed service that provides a comprehensive data visualization platform built on top of the Grafana software by Grafana Labs. It's made as a fully managed Azure service operated and supported by Microsoft. Grafana helps you combine metrics, logs and traces into a single user interface. With its extensive support for data sources and graphing capabilities, you can view and analyze your application and infrastructure telemetry data in real time.
Azure Managed Grafana is optimized for the Azure environment. It works seamlessly with many Azure services and provides the following integration features:
- Built-in support for Azure Managed Prometheus and Azure Data Explorer.
- User authentication and access control using Azure Active Directory identities.
- Direct import of existing charts from the Azure portal.
In particular, by integrating with Azure Monitor managed service for Prometheus, Azure Managed Grafana allows you to create rich and customizable dashboards to visualize the Prometheus metrics collected in an Azure Monitor workspace from one or more AKS clusters. Azure Managed Grafana enables you to gain deep visibility into your AKS clusters, troubleshoot issues, and make informed decisions based on real-time data. You can also set up Azure Monitor alerts and use them with Azure Managed Grafana.
For more information on Azure Managed Grafana, see the following articles:
- What is Azure Managed Grafana?
- Azure Managed Grafana service reliability
- Use Azure Monitor managed service for Prometheus as data source for Grafana using managed system identity
- Set up Azure Managed Grafana authentication and permissions
- How to configure data sources for Azure Managed Grafana
- Create a dashboard in Azure Managed Grafana
- Use Azure Monitor alerts with Grafana
- Set up private access using Azure Private Endpoints
- Connect to a data source privately using Azure Private Endpoints
- Enable zone redundancy in Azure Managed Grafana
- Troubleshoot issues for Azure Managed Grafana
- Grafana user interface
Data Collection Rules of an Azure Monitor Workspace
Data collection rules (DCRs) define the data collection process in Azure Monitor. Data collection rules (DCRs) specify what data should be collected, how to transform that data, and where to send that data. Some DCRs will be created and managed by Azure Monitor to collect a specific set of data to enable insights and visualizations. However, you can also create your own DCRs to define the set of data required for other scenarios.
Azure Monitor workspace, when combined with Azure Managed Prometheus, allows you to define Data Collection Rules (DCRs). These rules specify which metrics to collect and from which sources within the AKS cluster. By tailoring the data collection rules, you can focus on capturing the metrics most relevant to your monitoring needs.
Data collection rules can also be used to configure the Container Insights extension for an AKS cluster and configure an Azure Log Analytics workspace as a destination for the logs and metrics collected by Azure Monitor Agents on AKS.
Data Collection Endpoints of an Azure Monitor Workspace
Azure Monitor workspace provides data collection endpoints that enable AKS clusters to send their metrics to a central location. By configuring the clusters to forward metrics to the Azure Monitor workspace, you ensure that all the necessary data is collected and available for further analysis. When you configure your Azure Kubernetes Service (AKS) cluster to send data to an Azure Monitor workspace, a containerized version of the Azure Monitor agent is installed in the kube-system
namespace with a metrics extension. The Azure Monitor metrics agent's architecture utilizes a ReplicaSet and a DaemonSet. The ReplicaSet pod scrapes cluster-wide targets such as kube-state-metrics
and custom application targets that are specified. The DaemonSet pods scrape targets solely on the node that the respective pod is deployed on, such as `node-exporter``. The data collection rules use data collection endpoints for ingesting Prometheus metrics from the Azure Monitor metrics agents running on your AKS clusters.
Azure Monitor managed service for Prometheus Rule Groups
Azure Managed Prometheus offers rule groups comprising alert and recording rules. These rule groups provide powerful capabilities to define conditions for recording metrics and triggering alerts:
- Prometheus Recording rules allow you to precompute frequently needed or computationally extensive expressions and store their result as a new set of time series. Time series created by recording rules are ingested back to your Azure Monitor workspace as new Prometheus metrics.
- Prometheus Alert rules let you create an Azure Monitor alert based on the results of a Prometheus Query Language (Prom QL) query. Alerts fired by Azure Managed Prometheus alert rules are processed and trigger notifications in similar way to other Azure Monitor alerts.
Azure Managed Prometheus rule groups, recording rules and alert rules can be created and configured using the Microsoft.AlertsManagement/prometheusRuleGroups resource type. Prometheus rule groups are defined with a scope of a specific Azure Monitor workspace. Prometheus rule groups can be created using Bicep, Azure Resource Manager (ARM) templates, Terraform, API, Azure CLI, or PowerShell. For more information, see Azure Monitor managed service for Prometheus rule groups.
Azure Managed Prometheus vs. Azure Log Analytics
Azure Log Analytics, Container Insights, Azure Managed Prometheus, and Azure Managed Grafana are all monitoring and observability solutions available in Azure. Here is a comparison between them:
Functionality
Azure Log Analytics
: It's a versatile data collection, analysis, and visualization tool. It mainly focuses on the analysis of the diagnostic logs generated by Azure services but can also handle metrics and provides advanced querying capabilities via the Kusto Query Language (KQL).Container Insights
: It's a comprehensive monitoring solution specifically designed for Kubernetes clusters. It collects metrics, logs, and metadata about containers, nodes, and orchestrators. It offers features like performance analysis, auto-scaling recommendations, and anomaly detection. Container insights stores its data in a Log Analytics workspace.Azure Managed Prometheus
: It is a managed version of the open-source monitoring system Prometheus. It collects time-series metrics data and supports powerful querying and visualization features.Azure Managed Grafana
: It is a managed version of the popular visualization tool Grafana. It provides customizable dashboards and visualizations for metrics and logs data.
Integration
Azure Log Analytics
: It integrates well with other Azure services and can collect data from various sources including Azure Monitor, Azure Kubernetes Service (AKS), Azure Functions, and more.Container Insights
: It integrates tightly with Azure Log Analytics and collects memory and processor metrics generated by controllers, nodes, and containers in Azure Kubernetes Service, Azure Container Instance, or Azure-Arc enabled Kubernetes clusters.Azure Managed Prometheus
: It integrates with Azure Monitor and can send metrics data directly to Azure Monitor for alerting and analysis.Azure Managed Grafana
: It can connect to various data sources including Azure Monitor, Azure Log Analytics, and Azure Managed Prometheus to visualize the collected data.
Data Collection:
Azure Log Analytics
: It can collect logs and metrics data from various sources like virtual machines, containers, applications, and custom data sources.Container Insights
: It collects metrics, logs, and metadata specifically from Kubernetes clusters and containers.Azure Managed Prometheus
: It collects time-series metrics data from applications or infrastructure components.Azure Managed Grafana
: It visualizes the data collected by other monitoring solutions like Azure Monitor, Azure Log Analytics, and Azure Managed Prometheus.
Visualization:
Azure Log Analytics
: It provides its visualization capabilities with query-based visualizations and advanced workbook features.Container Insights
: It provides built-in visualizations and dashboards specific to Kubernetes clusters.Azure Managed Prometheus
: It supports powerful graphing and visualization capabilities using Prometheus Query Language (PromQL) and can be integrated with Grafana for more advanced visualizations.Azure Managed Grafana
: It is a dedicated visualization tool that offers highly customizable and interactive dashboards.
Azure Log Analytics, Container Insights, Azure Managed Prometheus, and Azure Managed Grafana have different focuses and functionalities. While Azure Log Analytics and Container Insights are versatile monitoring solutions, Azure Managed Prometheus and Azure Managed Grafana provide specific metrics collection and visualization features. The choice between them depends on the specific monitoring needs and preferences.
Deploy the Bicep modules
You can deploy the Bicep modules in the bicep
folder using the deploy.sh
Bash script in the same folder. Specify a value for the following parameters in the deploy.sh
script and main.parameters.json
parameters file before deploying the Bicep modules.
prefix
: specifies a prefix for all the Azure resources.authenticationType
: specifies the type of authentication when accessing the Virtual Machine.sshPublicKey
is the recommended value. Allowed values:sshPublicKey
andpassword
.vmAdminUsername
: specifies the name of the administrator account of the virtual machine.vmAdminPasswordOrKey
: specifies the SSH Key or password for the virtual machine.aksClusterSshPublicKey
: specifies the SSH Key or password for AKS cluster agent nodes.aadProfileAdminGroupObjectIDs
: when deploying an AKS cluster with Azure AD and Azure RBAC integration, this array parameter contains the list of Azure AD group object IDs that will have the admin role of the cluster.keyVaultObjectIds
: Specifies the object ID of the service principals to configure in Key Vault access policies.windowsAgentPoolEnabled
: Specifies whether to create a Windows Server agent pool.
We suggest reading sensitive configuration data such as passwords or SSH keys from a pre-existing Azure Key Vault resource. For more information, see Use Azure Key Vault to pass secure parameter value during Bicep deployment.
Azure Managed Prometheus Bicep Module
The following table contains the code from the managedPrometheus.bicep
Bicep module used to deploy an Azure Monitor managed service for Prometheus workspace.
The Bicep module deploys the following Azure resources and child resources:
- An Azure Monitor workspace for Managed Prometheus
- A data collection endpoint used by the AKS-hosted Azure Monitor Agents to send Prometheus metrics to the Azure Monitor workspace.
- A data collection rule that uses the data collection endpoint defined by the previous step and defines the Azure Monitor worekspace as a destination of the Prometheus metrics collected by the Azure Monitor Agents on the AKS cluster.
- A data collection rule association that binds the data collection rule with the AKS cluster.
- A series of Prometheus rule groups that define Prometheus recording rules and Prometheus alert rules for Linux and Windows node pools.
Deploying an Azure Monitor workspace automatically creates a data collection rule and endpoint resources. For instance, when you create an Azure Monitor workspace resource, a resource group named MA_<worskpace-name>_<region>_managed
and a data collection rule and endpoint resources are created. These resources are associated with the Azure Monitor workspace.
Azure Managed Grafana Bicep module
The managedGrafana.bicep
Bicep module is used to deploy the Azure Managed Grafana.
The Bicep modules creates an Azure Managed Grafana with a system-assigned managed identity. The azureMonitorWorkspaceIntegrations
array contains the resource id of the Azure Monitor managed service for Prometheus. For more information on integrating an Azure Managed Grafana with an Azure Monitor workspace, see Collect Prometheus metrics from an AKS cluster.
By default, when a Grafana instance is created, Azure Managed Grafana grants it the Monitoring Reader
role for all Azure Monitor data and Log Analytics resources within a subscription. This means the new Grafana instance can access and search all monitoring data in the subscription. It can view the Azure Monitor metrics and logs from all resources, and any logs stored in Log Analytics workspaces in the subscription. The Bicep module manually assigns the Monitoring Reader
role to the Azure Managed Grafana system-assigned managed identity at the workspace scope. For more information, see How to modify access permissions to Azure Monitor.
The Bicep module assigns the Monitoring Data Reader
role to the Azure Managed Grafana system-assigned managed identity with the Azure Monitor workspace. For more information, see Use Azure Monitor managed service for Prometheus as data source for Grafana using managed system identity.
The Bicep module creates a Grafana Admin
role assignment on the Azure Managed Grafana for the Microsoft Entra UD user whose objectID is defined in the userId
parameter. The Grafana Admin
role provides full control of the instance including managing role assignments, viewing, editing, and configuring data sources. For more information, see How to share access to Azure Managed Grafana.
Action Group Bicep Module
he actionGroup.bicep
Bicep module is used to deploy an Action Group used to handle the alerts generated by the clusters. When Azure Monitor data indicates that there might be a problem with your infrastructure or application, an alert is triggered. Alerts can contain action groups, which are a collection of notification preferences. Azure Monitor, Azure Service Health, and Azure Advisor use action groups to notify users about the alert and take an action.
The bicep Module creates an action group with two actions:
Email
: when theemailAddress
parameter is not empty, the module creates an email receiver to send notifications to the specified email address.SMS
: when thecountryCode
and thephoneNumber
parameters are not empty, the module creates an SMS receiver to send SMS notifications to the specified phone number.
Deployment Script
The sample makes use of a Deployment Script to run the install-nginx-with-prometheus-metrics-and-create-sa.sh
Bash script that creates the namespace and service account for the sample application and installs the following packages to the AKS cluster via Helm. For more information on deployment scripts, see Use deployment scripts in Bicep. When you deploy ingresses, the add-on creates publicly accessible DNS names for endpoints on an Azure DNS zone.
This sample uses the NGINX Ingress Controller to expose Linux and Windows demo applications that you can find in the apps folder.
The install-nginx-with-prometheus-metrics-and-create-sa.sh
Bash script can run on a public AKS cluster or on a private AKS cluster using the az aks command invoke. For more information, see Use command invoke to access a private Azure Kubernetes Service (AKS) cluster.
The install-nginx-with-prometheus-metrics-and-create-sa.sh
Bash script returns the following outputs to the deployment script:
- Cert-manager namespace
- NGINX ingress controller namespace
Alternatively, you can install the NGINX Ingress Controller and External DNS controller using the Azure Kubernetes Service (AKS) ingress with the application routing add-on. The application routing add-on configures an NGINX ingress controller in your Azure Kubernetes Service (AKS) cluster with SSL termination through certificates stored in Azure Key Vault.
Azure Monitor Workspace Metrics
Once you deployed the Azure Monitor workspace for Managed Prometheus and Azure Managed Grafana, and you configured an Azure Kubernetes Service (AKS) cluster to collect Prometheus metrics in the Azure Monitor workspace, you can access the Metrics of the workspace and use the Events Per Minute Ingested
metric to verify that the workspace is properly receiving events from your AKS cluster, as shown in the following figure:
The following table lists the metrics available for the Microsoft.Monitor/accounts
resource type.
Table headings
Metric - Metric display name follows by a description of the metric. The displayname appears in the Azure portal.
Name - The name of the metric as referred to in the REST API.
Unit - The default units used for the metric.
Aggregation - The default aggregation type for this metric. Valid values: Average, Minimum, Maximum, Total, Count.
Dimensions - Dimensions available. For more information, see (link to dimensions information).
DS Export- Whether the metric is exportable to Azure Monitor Logs via Diagnostic Settings. You can access all metrics via the REST API.
Metric | Name | Unit | Aggregation | Dimensions | DS Export |
---|---|---|---|---|---|
Active Time Series % Utilization
The percentage of current active time series account limit being utilized |
ActiveTimeSeriesPercentUtilization |
Percent | Average | StampColor | No |
Active Time Series
The number of unique time series recently ingested into the account over the previous 12 hours |
ActiveTimeSeries |
Count | Maximum | StampColor | No |
Active Time Series Limit
The limit on the number of unique time series which can be actively ingested into the account |
ActiveTimeSeriesLimit |
Count | Maximum | StampColor | No |
Events Per Minute Ingested
The number of events per minute recently received |
EventsPerMinuteIngested |
Count | Maximum | StampColor | No |
Events Per Minute Ingested Limit
The maximum number of events per minute which can be received before events become throttled |
EventsPerMinuteIngestedLimit |
Count | Maximum | StampColor | No |
Events Per Minute Ingested % Utilization
The percentage of the current metric ingestion rate limit being utilized |
EventsPerMinuteIngestedPercentUtilization |
Percent | Average | StampColor | No |
Simple Data Samples Stored
The total number of samples stored for simple sampling types (like sum, count). For Prometheus this is equivalent to the number of samples scraped and ingested. |
SimpleSamplesStored |
Count | Maximum | StampColor | No |
Enable the collection of Windows metrics in Prometheus format
Azure Managed Prometheus supports collecting metrics in Prometheus format from the nodes of a Windows Server agent pool. Onboarding to the Azure Monitor Metrics add-on enables the Windows DaemonSet pods to start running on your node pools. Both Windows Server 2019 and Windows Server 2022 are supported. You must follow these steps to enable the pods to collect metrics from your Windows node pools.
Manually install the windows-exporter
DaemonSet on Windows nodes to scrape metrics in Prometheus format. This enables the following collectors:
[defaults]
container
memory
process
cpu_info
You can download, customize, and deploy the windows-exporter-daemonset YAML manifest as follows:
If you defined one or more taints on your Windows agent pools, make sure to add the necessary tolerations to the pod definition as shown in the windows/windows-exporter-daemonset.yaml
YAML manifest:
Then, apply the ama-metrics-settings-configmap to your cluster. Set the windowsexporter
and windowskubeproxy
Booleans to true
. The following windows/ama-metrics-settings-configmap.yaml
YAML manifest shows how tocustomize the ama-metrics-settings-configmap
configmap. For more information, see Metrics add-on settings configmap.
You can deploy the configmap using the following command:
Finally, you have to enable the recording rules that are required for the out-of-the-box dashboards:
- If onboarding using the Azure CLI, include the option
--enable-windows-recording-rules
. - If onboarding using an ARM template, Bicep, or Azure Policy, enable the
Microsoft.AlertsManagement/prometheusRuleGroups
resources used to collect Windows metrics. - If the cluster is already onboarded, use this ARM template and this parameter file to create the rule groups.
The Bicep module in this project automatically creates and enables the necessary rule groups to collect metrics from Windows nodes in Prometheus format. For more information on how to customize metrics scraping for a Kubernetes cluster with the metrics addon in Azure Monitor., see Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus.
Verify the Deployment of the Azure Monitor Agents
Run the following command to verify that the DaemonSet was correctly deployed on the Linux node pools:
The number of pods should equal the number of Linux nodes on the cluster. The output should resemble the following example:
Run the following command to verify that the DaemonSet was correctly deployed on the Windows node pools:
The number of pods should equal the number of Windows nodes on the cluster. The output should resemble the following example:
Run the following command to verify that the two ReplicaSets were deployed properly:
The output should resemble the following example:
Azure Managed Grafana Endpoint
You can retrieve the endpoint URL of an Azure Managed Grafana from the Azure Portal, as shown in the following figure:
Alternatively, you can run the az grafana show Azure CLI command to get the endpoint URL of an Azure Managed Grafana:
Azure Managed Prometheus Data Source
You can retrieve the Azure Managed Prometheus data source under the Home > Administration > Data sources
page in the Grafana portal, as shown in the following figure:
Alternatively, you can run the az grafana data-source list Azure CLI command to List all data sources of an Azure Managed Grafana instance:
If you know the name of the data source, you can use the az grafana data-source show Azure CLI command to get the details of a data source. The name of the dashboard is case-sensitive. For example, the following command:
returns the following result in JSON format:
Default Dashboards
The following default dashboards are automatically provisioned and configured by Azure Monitor managed service for Prometheus when you link your Azure Monitor workspace to an Azure Managed Grafana instance. The source code for these dashboards can be found in this GitHub repository. The below dashboards will be provisioned in the specified Azure Grafana instance under Managed Prometheus
folder in Grafana.
These are the standard open-source community dashboards for monitoring Kubernetes clusters with Prometheus and Grafana.
Kubernetes / Compute Resources / Cluster
Kubernetes / Compute Resources / Namespace (Pods)
Kubernetes / Compute Resources / Node (Pods)
Kubernetes / Compute Resources / Pod
Kubernetes / Compute Resources / Namespace (Workloads)
Kubernetes / Compute Resources / Workload
Kubernetes / Kubelet
Node Exporter / USE Method / Node
Node Exporter / Nodes
Kubernetes / Compute Resources / Cluster (Windows)
Kubernetes / Compute Resources / Namespace (Windows)
Kubernetes / Compute Resources / Pod (Windows)
Kubernetes / USE Method / Cluster (Windows)
Kubernetes / USE Method / Node (Windows)
The following figure shows the Kubernetes / Compute Resources / Cluster
dashboard for an Azure Kubernetes Service (AKS) configured to collect Prometheus metrics to an Azure Monitor workspace integrated with the Azure Managed Grafana instance.
The following figure shows the Kubernetes / Compute Resources / Cluster (Windows)
dashboard for an Azure Kubernetes Service (AKS) with a Windows agent pool.
NGINX Ingress Controller
The NGINX Ingress Controller can be configured to generate metrics in Prometheus format. If you deployed the NGINX Ingress controller using a Helm chart, the easiest way to configure the controller for metrics is via Helm upgrade. Assuming you have installed the NGINX ingress controller as a Helm release named ingress-nginx
in the ingress-basic
namespace, then you can type the command shown below:
In our sample, the bash script executed by the deployment script creates the NGINX ingress controller via Helm with the above settings. For more information, see Prometheus and Grafana installation using Service Monitors.
You can configure the Azure Monitor Agents on the Linux nodes of your AKS cluster to collect the NGINX ingress controller metrics in Prometheus format by customizing the ama-metrics-prometheus-config
configmap in the kube-system
namespace. The nginx/ama-metrics-prometheus-config-configmap.yaml
YAML manifest shows how to customize the configmap to scrape the Prometheus metrics generated by the NGINX ingress controller and by any pod in any namespace with the with the annotation prometheus.io/scrape: true
.
You can deploy the above configmap using the following command:
Generally, you can use four different configmaps to provide scrape configuration and other settings for the metrics add-on. All the configmaps should be applied to kube-system
namespace for any cluster. None of the four configmaps exist by default in the cluster when Azure Managed Prometheus is enabled. Depending on what needs to be customized, you need to deploy any or all of these four configmaps with the same name specified, in kube-system
namespace. AMA-Metrics pods will pick up these configmaps after you deploy them to kube-system
namespace, and will restart in 2-3 minutes to apply the configuration settings specified in the configmap(s).
ama-metrics-settings-configmap
This config map has below simple settings that can be configured. You can take the configmap from the above git hub repo, change the settings are required and apply/deploy the configmap tokube-system
namespace for your cluster- cluster alias (to change the value of
cluster
label in every time-series/metric that's ingested from a cluster) - enable/disable default scrape targets - Turn ON/OFF default scraping based on targets. Scrape configuration for these default targets are already pre-defined/built-in
- enable pod annotation based scraping per namespace
- metric keep-lists - this setting is used to control which metrics are listed to be allowed from each default target and to change the default behavior
- scrape intervals for default/pre-definetargets.
30 secs
is the default scrape frequency and it can be changed per default target using this configmap - debug-mode - turning this ON helps to debug missing metric/ingestion issues - see more on troubleshooting
- cluster alias (to change the value of
ama-metrics-prometheus-config
This config map can be used to provide Prometheus scrape config for addon replica. Addon runs a singleton replica, and any cluster level services can be discovered and scraped by providing scrape jobs in this configmap. You can take the sample configmap from the above git hub repo, add scrape jobs that you would need and apply/deploy the config map tokube-system
namespace for your cluster.ama-metrics-prometheus-config-node
This config map can be used to provide Prometheus scrape config for addon DaemonSet that runs on every Linux node in the cluster, and any node level targets on each node can be scraped by providing scrape jobs in this configmap. When you use this configmap, you can use$NODE_IP
variable in your scrape config, which gets substituted by corresponding node's ip address in DaemonSet pod running on each node. This way you get access to scrape anything that runs on that node from the metrics addon DaemonSet. Please be careful when you use discoveries in scrape config in this node level config map, as every node in the cluster will setup & discover the target(s) and will collect redundant metrics. You can take the sample configmap from the above git hub repo, add scrape jobs that you would need and apply/deploy the config map tokube-system
namespace for your clusterama-metrics-prometheus-config-node-windows
This config map can be used to provide Prometheus scrape config for addon DaemonSet that runs on every Windows node in the cluster, and node level targets on each node can be scraped by providing scrape jobs in this configmap. When you use this configmap, you can use$NODE_IP
variable in your scrape config, which will be substituted by corresponding node's ip address in DaemonSet pod running on each node. This way you get access to scrape anything that runs on that node from the metrics addon DaemonSet. Please be careful when you use discoveries in scrape config in this node level config map, as every node in the cluster will setup & discover the target(s) and will collect redundant metrics. You can take the sample configmap from the above git hub repo, add scrape jobs that you would need and apply/deploy the config map tokube-system
namespace for your cluster
For more information, see Customize scraping of Prometheus metrics in Azure Monitor managed service for Prometheus.
The complete list of Prometheus metrics exposed by the NGINX ingress controller is available in the documentation. NGINX provides two Grafana dashboards to visualize these metrics. I customized these two dashboards for Azure Managed Grafana. In particular, I added a cluster
variable that allows to select one or all AKS clusters. You can find these dashboards' definitions in JSON format under the nginx
folder.
The Kubernetes / NGINX Ingress controller / By Cluster
dashboard allows to filter bycluster
,namespace
,controller class
,controller
, andingress
and shows the following metrics:
Controller Request Volume
Controller Connections
Controller Success Rate
Config Reloads
Last Config Failed
Ingress Request Volume
Ingress Success Rate
Network I/O Pressure
Average Memory Usage
Average CPU Usage
Ingress Percentile Response Times and Transfer Rates (P50 Latency, P90 Latency, P99 Latency, IN throughput, OUT throughput)
Ingress Percentile Response Times
Ingress Request Latency Heatmap
Ingress SSL Certificate Expiry
The Kubernetes / NGINX Ingress controller / Request Handling Performance
dashboard allows to filter by cluster
and ingress
and shows the following metrics:
Request Latency Percentiles
: The P5, P95, and P99 request latency percentiles.Upstream Response Latency Percentiles
: The P5, P95, and P99 upstream response latency percentiles.Request Rate by Method and Path
: The request rate by HTTP method and path.Median Upstream Response Time by Method and Path
: The median upstream response time by HTTP method and path.Response Error Rate by Method and Path
: The Percentage of 4xx and 5xx responses among all responses by HTTP method and path.Upstream Response Time by Method and Path
: The sum of upstream request time by HTTP method and path.Response Error Rate by Method and Path
: The response error rate by HTTP method and path.Average Response Size by Method and Path
: The average response time size by HTTP method and path.Upstream Service Latency
: The upstream service latency.
Network Observability
Network observability is essential to maintaining a healthy and performant Kubernetes cluster. By collecting and analyzing data about network traffic, you can gain insights into how your cluster operates and identify potential problems before they cause outages or performance degradation. For more information, see What is Azure Kubernetes Service (AKS) Network Observability?
Networking Observability add-on operates seamlessly on non-Cilium and Cilium data-planes. Once enabled, this feature provides network administrators, cluster security administrators, and DevOps engineers with a tool to monitor network issues in your Azure Kubernetes Service (AKS) cluster.
When the Network Observability add-on is enabled, it allows for collecting and converting useful metrics into Prometheus format, which can then be visualized in Grafana. There are two options available for using Prometheus and Grafana in this context: Azure Managed Prometheus and Azure Managed Grafana or bring your own (BYO) in-cluster or out-cluster Prometheus and Grafana.
-
Azure Managed Prometheus and Grafana: This option involves using a managed service provided by Azure as shown in this article. The managed service takes care of the infrastructure and maintenance of Prometheus and Grafana, allowing you to focus on configuring and visualizing your metrics. This option is convenient if you prefer not to manage the underlying infrastructure and eventually shared the same instances across multiple AKS clusters. For more information, see Setup Network Observability for Azure Kubernetes Service (AKS) Azure managed Prometheus and Grafana
-
BYO Prometheus and Grafana: Alternatively, you can choose to set up your own Prometheus and Grafana instancesn inside or outside the cluster. In this case, you're responsible for provisioning and managing the infrastructure required to run Prometheus and Grafana. Install and configure Prometheus to scrape the metrics generated by the Network Observability add-on and store them. Similarly, Grafana needs to be set up to connect to Prometheus and visualize the collected data. For more information, see Setup Network Observability for Azure Kubernetes Service (AKS) BYO Prometheus and Grafana
In both cases, you can use ID 18814 to import the Kubernetes / Networking dashboard from Grafana's public dashboard repository. This Grafana dashboard for Network Observability add-on for AKS provides the following benefits:
- Give visibility of the cluster-level network metrics like packet drops, connections stats, and more.
- (GA) Access to pod-level metrics and network debuggability features
- Support for all Azure CNIs - AzureCNI and AzureCNI (Powered by Cilium)
- Support for all AKS node types - Linux and Windows
- Easy deployment using native Azure tools - AKS CLI, ARM templates, PowerShell, etc.
- Seamless integration with the Azure managed Prometheus and Azure-managed Grafana offerings.
Under the network-observability
folder, you can find the Kubernetes / Network Observability / Networking
dashboard in JSON format that customizes the Kubernetes / Networking dashboard by adding the possibility to filter by cluster.
The Network Observability add-on currently only supports node-level metrics in Linux and Windows platforms. The table below outlines the different metrics generated by the Network Observability add-on.
Metric Name | Description | Labels | Linux | Windows |
---|---|---|---|---|
kappie_forward_count | Total forwarded packet count | Direction, NodeName, Cluster | Yes | Yes |
kappie_forward_bytes | Total forwarded byte count | Direction, NodeName, Cluster | Yes | Yes |
kappie_drop_count | Total dropped packet count | Reason, Direction, NodeName, Cluster | Yes | Yes |
kappie_drop_bytes | Total dropped byte count | Reason, Direction, NodeName, Cluster | Yes | Yes |
kappie_tcp_state | TCP active socket count by TCP state. | State, NodeName, Cluster | Yes | Yes |
kappie_tcp_connection_remote | TCP active socket count by remote address. | Address, Port, NodeName, Cluster | Yes | No |
kappie_tcp_connection_stats | TCP connection statistics. (ex: Delayed ACKs, TCPKeepAlive, TCPSackFailures) | Statistic, NodeName, Cluster | Yes | Yes |
kappie_tcp_flag_counters | TCP packets count by flag. | Flag, NodeName, Cluster | Yes | Yes |
kappie_ip_connection_stats | IP connection statistics. | Statistic, NodeName, Cluster | Yes | No |
kappie_udp_connection_stats | UDP connection statistics. | Statistic, NodeName, Cluster | Yes | No |
kappie_udp_active_sockets | UDP active socket count | NodeName, Cluster | Yes | No |
kappie_interface_stats | Interface statistics. | InterfaceName, Statistic, NodeName, Cluster | Yes | Yes |
You can enable the Network Observability add-on when you deploy the Azure Kubernetes Service (AKS) cluster via Bicep by setting the value of the monitoringEnabled
parameter to true
. This will set the networkProfile.monitoring.enabled
property to true
of the AKS cluster in the aksCluster.bicep
module.
Review deployed resources
You can use the Azure portal to list the deployed resources in the resource group, as shown in the following picture:
You can also use Azure CLI to list the deployed resources in the resource group:
You can also use the following PowerShell cmdlet to list the deployed resources in the resource group:
Clean up resources
You can delete the resource group using the following Azure CLI command when you no longer need the resources you created. This will remove all the Azure resources.
Alternatively, you can use the following PowerShell cmdlet to delete the resource group and all the Azure resources.