Azure Monitor for Enterprises and not only. How does Azure CxP team leverage Policies at-scale?

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Hi everyone,

I am a senior service engineer at Microsoft, working with our S500 customers around the world. Our CxP organization focuses on Customer Success Experience and to achieve customer success we leverage our best solutions and tools at-scale.

Currently, my technical focus is a centralized approach for monitoring, governance, and automation. This is where Azure Policies, Azure Monitor, Azure Lighthouse play a significant role. And I would like to share our experience based on learnings and research and work with Azure product groups.

I came from on-premises world, where many things happened in customers data center, where operations and developers usually lived separately :) I am a huge fan of System Center Operation Manager, I worked with this tool from 2008 and SCOM was my first monitoring tool that I am still in love with. And it helped me understand how monitoring should look like not just from a technical standpoint.

Moving back to Azure. It has so many services and tools for automation, and one of the biggest challenges in Azure is to find the right tool\service that could help you to solve your problems. If you work with Azure Monitor, you have already known that it is a mix of services like Log Analytics, Metrics Storage, Application Insights, Azure Alert Rules, Processing Rules, Action Groups, Workbooks, Grafana Managed Instance, etc. All of them are great, but sometimes all of them require a common centralized approach from manageability perspective.

If you work with SCOM or other similar monitoring tools, the first step of monitoring starts with the discovery of monitored objects. When you think at scale (in the enterprise world you should always think at scale, IMHO), it should be automated in a most efficient way, and it should be native as much as possible.

So, the first step is discovery. How do we do this? And the answer is Azure Policies. Because they are native, because they are ready to work at scale on various levels – management groups, subscriptions, resource groups. Because they have a native “discovery” process – compliance assessments. This is how we discover new added resources being deployed by developers and partners in their Landing Zones or other subscription-based workloads.

Our Azure Policies leverage DINE approach is “deploy if not exist”. Additionally, Azure Policies allow to remediate the compliance for existing resources. On the other hand, they discover new virtual machines, application gateways, load balancers, Redis Cache instances, Azure SQL databases and another different resource types that should follow the compliance of Azure Policies.

If a new virtual machine is created, the DINE policy deploys a set of required alert rules being designed and developed by a centralized monitoring team, responsible for monitoring. It works like a charm; you do not need to care about discovery – Azure Policies will do this thing automatically.

What about design? Not every set of alert rules is suitable for some sort of applications, or in some cases you need to define a little bit different thresholds and scopes (again, think at scale). It is like having something like “Overrides” for SCOM for specific workloads/groups/monitored objects. And again – this is where Policies could be set differently on different scopes and levels of your workloads. When you set up a new Azure Policies Initiative, you configure a new setup of alert rules designed for a particular workload.

We do monitoring design first. We practice actionable alerts only; we always try to avoid any noise because every alert generates a ticket being investigated by service desk teams. This is where your previous learnings and findings are key things to make your monitoring successful.

We practice Infrastructure as Code, we trust our source of truth, we want to have exact configuration for each customer that potentially could change by himself. This is where we have built a centralized repository in our Azure DevOps project that includes 100+ Policies with default thresholds and set of Policies focused on a workload-centric approach. And again, we practice this because we think at-scale. If tomorrow a new customer is going to be managed and monitored by us, we are ready to onboard the customer at scale leveraging past learnings from previous success engagements.

Summarizing the above,

Azure Monitor is a great service and ready to be onboarded at-scale. Leverage best tools and practices, make alerts actionable, think at-scale and use automation where it is possible.

Leave a Reply Cancel reply