Exploring Anomalies with Log Analytics using KQL

This post has been republished via RSS; it originally appeared at: Core Infrastructure and Security Blog articles.

Detecting anomalies in your data can be a very powerful and desired functionality. Within Azure Monitor we provide a really easy method to alert on Anomalies if they are coming from Metrics (Creating Alerts with Dynamic Thresholds in Azure Monitor - Azure Monitor | Microsoft Docs). But what if the anomalies you want to detect are not a metric but sit in Application Insights or Log Analytics. Well, that’s where the Kusto query language comes to the rescue.

 

Detecting Anomalies with Kusto

Kusto has anomaly detection built in using series_decompose_anomalies.

 

series_decompose_anomalies() - Azure Data Explorer | Microsoft Docs

 

Now I’m not going to lie, the first time I read the above article I came away a little confused. But once you’ve built a query a few time using this then it becomes fairly simple.

 

Some of the key things you need to do to utilize this is:

  • You need to pull the data that you want to detect anomalies on
  • You need to order the results.
  • You need to then create either a list or series before you use the series_decompose_anomalies
  • Create a new column that detect the anomalies.

I think the best way to show this is to walk through a scenario.

 

Scenario

We want to look at the number of events occurring on each of our servers in the System event log. We want to detect any anomalies where more events than normal happen on a server. To make this query even more useful we’ll take the list of servers that have had anomalies and chart them by eventid.

 

Step 1: Pulling the Data

Step one is to get the data that you want to detect anomalies on. What the below query will do is filter to only event in the “System” log and then create a count of events for each server in 30 minute aggregates.

 

Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m)

 

So the output from just this query would look something like this:

bwatts670_0-1617478804224.png

 

Step 2: Order the results

Before we create a list or series we need to order the results by the time generated. This is the simplest step but essential if you want accurate results! Just add the following line to your query:

 

| order by TimeGenerated

 

So now our query looks like this.

 

Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m) | order by TimeGenerated

 

Step 3: Make a List

Now we have everything ready to create the list in Kusto. Below is the line you need to add to your query. This will make a list of both the TimeGenerated field and the EventCount field. So what we’ll end up with is a single line for each server with a list of the TImeGenerated and EventCount fields.

 

| summarize EventCount=make_list(EventCount),TimeGenerated=make_list(TimeGenerated) by Computer

 

So the query looks like this

 

Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m) | order by TimeGenerated | summarize EventCount=make_list(EventCount),TimeGenerated=make_list(TimeGenerated) by Computer

 

And the output should look something like this:

bwatts670_1-1617478804229.png

 

Step 4: Detect Anomalies

We have our query setup to detect the anomalies, we just need to pass the “EventCount” field to series_decompose_anomalies and create a new column with the results:

 

| extend outliers=series_decompose_anomalies(EventCount)

 

So the outliers filed will contain 3 values

              0=Normal

              -1=less events than normal

              1=more events than normal

So our query would now look like this:

 

Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m) | order by TimeGenerated | summarize EventCount=make_list(EventCount),TimeGenerated=make_list(TimeGenerated) by Computer | extend outliers=series_decompose_anomalies(EventCount)

 

And the output should look something like this:

bwatts670_2-1617478804235.png

As you can see, we have some anomalies that got detected.

 

Extra Credit

When I get to this point I always like to look at where the anomalies were detected and make sure that I would consider them anomalies. I won’t go into it in this article but you can adjust how sensitive the calculations. For me to easily see where the anomalies were detected I’ll use mvexpand on any of the list that we’ve made:

 

| mv-expand TimeGenerated, EventCount, outliers

 

Then look at the results to make sure where I see a 1 (we’ll later filter to only positive anomalies) it makes sense:

bwatts670_3-1617478804240.png

 

I’d also check on the ones that stay at 0 to make sure the EventCounts are pretty close:

bwatts670_4-1617478804244.png

 

Step 5: Get Useful Information

Not saying what we have right now isn’t useful but I like to use the anomalies info to get more detailed information. So in this case we detected anomalies for the number of events on a server. Now that we know the server had an anomaly maybe we want to graph that by the EventId to determine what caused the anomaly. Lets start by expanding the list like before but then filtering to only outliers that equal 1

 

| mv-expand TimeGenerated, EventCount, outliers | where outliers == 1

 

This will give us every outlier detected by each server. What I really need is a list of each server that has had at least one anomaly. So I just at the following:

 

| distinct Computer

 

I want to feed the results into another query so let me set a variable called “Computers” using the let command. So our completed initial query looks like this:

 

let Computers=Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m) | order by TimeGenerated | summarize EventCount=make_list(EventCount),TimeGenerated=make_list(TimeGenerated) by Computer | extend outliers=series_decompose_anomalies(EventCount) | mv-expand TimeGenerated, EventCount, outliers | where outliers == 1 | distinct Computer;

 

So now I can in my next query I can filter down to just these computers using “| where Computer in (Computers)” where Computers is fed in from the above query.

 

….(prev query) Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | where Computer in (Computers) | extend ChartName = strcat(Computer,'-',EventID) | summarize EventCount=count() by ChartName,bin(TimeGenerated, 30m) | render timechart

 

The only thing different I did here is create a “ChartName” field that is a combination of the Computer and the EventID. I’m using the same time period (7d) and the same aggregation (30m) for both the anomaly detection and this second query. That way you’re looking at the same data for both.

 

Below is the complete query:

 

 

let Computers=Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | summarize EventCount=count() by Computer, bin(TimeGenerated,30m) | order by TimeGenerated | summarize EventCount=make_list(EventCount),TimeGenerated=make_list(TimeGenerated) by Computer | extend outliers=series_decompose_anomalies(EventCount) | mv-expand TimeGenerated, EventCount, outliers | where outliers == 1 | distinct Computer; Event | where TimeGenerated >= ago(7d) | where EventLog == 'System' | where Computer in (Computers) | extend ChartName = strcat(Computer,'-',EventID) | summarize EventCount=count() by ChartName,bin(TimeGenerated, 30m) | render timechart

 

 Here are the results in my environment.

 

bwatts670_5-1617478804259.png

 

 

I just have one server in my environment so charting it makes sense. I would probably analyze the data in a table if I had a bunch of servers. But you can see really quickly I can see that EventID 7036 is the one that is causing the anomalies for this server. The rest of the events are staying stable, but that event varies a good bit on the server.

 

If you’re interested in another scenario where this same process can be useful check out my previous blog about “Detecting Azure Cost Anomalies.”

 

Summary

Once you get the hand of “series_decompose_anomalies” it can be a very useful tool in your toolbelt. I covered using this to visualize anomalies in the number of events occurring in the System event log. As long as you can follow the steps above (Get the data you want, order the data, make a list or series, and then detect anomalies) you can explore your data for anomalies. Just like we can visualize the data we can also setup alerts through Azure Monitor.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.