Calculating downtime using Window functions and Time Series functions in Azure Data Explorer

This post has been republished via RSS; it originally appeared at: Azure Data Explorer articles.

Customers interacting with Time Series, IoT Analytics, and Infra/App Logs often have a challenge identifying the total downtime of a device or an application. In this blog we will look at some of the common scenarios and see how we can use the power of Kusto Query Language to solve this challenge. Please note that these sample queries do not cover all the edge cases, please use them as a reference and adjust the logic for your scenario.

 

Scenario 1: Compute downtime based on health status

In this scenario, the devices are reporting health status as ‘Up’ or ‘Down’ or a binary value 1 or 0  or anything else. Our task is to identify how long the device was in a specific state. In this case we are interested in how long the device was down. Check the below sample query to identify the change in status and compute the start and end time of a given state.

 

Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples

let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string, State:int)
[
'2020-04-24 10:40:00', '12987679', 0,
'2020-04-24 10:40:00', '21998045', 0,
'2020-04-24 11:00:00', '12987679', 1,
'2020-04-24 11:05:00', '12987679', 1,
'2020-04-24 11:49:00', '12987679', 0,
'2020-04-24 11:30:00', '21998045', 1,
'2020-04-24 12:30:00', '21998045', 0,
'2020-04-24 14:30:00', '21998045', 1,
'2020-04-24 12:45:00', '12987679', 0,
'2020-04-24 12:48:00', '12987679', 0,
'2020-04-24 13:00:00', '12987679', 1,
'2020-04-24 16:14:00', '21998045', 0,
'2020-04-24 16:30:00', '21998045', 1,
'2020-04-24 16:31:00', '21998045', 0
];
DeviceHealth
| where TimeStamp between (datetime('2020-04-23') .. datetime(2020-04-25))
| summarize TimeStampList = make_list(TimeStamp), StateList=make_list(State) by DeviceId
| mv-apply TimeStampList to typeof(datetime), StateList to typeof(string)
on
(
    order by TimeStampList asc
    | extend PrvTimeStamp=prev(TimeStampList)
    | extend PrvState=prev(StateList)
    | extend OutageStartTime= iff((PrvState == 1 or isempty(PrvState)) and StateList==0, TimeStampList, todatetime('')) 
    | extend OutageEndTime= iff(PrvState == 0 and StateList==1, TimeStampList,  todatetime('')) 
    | where isempty(OutageStartTime) <> isempty(OutageEndTime)
    | extend OutageStartTime = iif(isempty(OutageStartTime), prev(OutageStartTime),OutageStartTime)
    | extend OutageEndTime = iif(isempty(OutageEndTime ), next(OutageEndTime),OutageEndTime )
    | summarize by DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime)
)

Sample output:

Scenerio1Output.JPG

 

Explanation of the above sample query:

  1. Create an in-memory DeviceHealth table to get some sample data.
  2. Apply the time filter
  3. Create lists for TimeStamp and State column by DeviceID.
  4. Use mv-apply operator to expand each List in into a sub-table, apply a sub-query to each sub-table, and returns the union of the results of all sub-queries.
  5. In the sub queries, order the records by TimeStamp – this also materializes the records so that we can apply the window function prev() to get the previous row data to detect a state change.
  6. Finally, we eliminate the unwanted rows and return the summary of downtime start and end time by DeviceId.

 

Scenario 2: Compute downtime based on missing signals

In some cases, the devices don’t report specific health status, instead just stop sending signals. The below sample query identifies missing signals and compute the start and end time when the signal was missing.

 

Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples

let DeviceTelemetry = datatable (TimeStamp:datetime, DeviceId:string, BatteryLevel:int, Temp:real, Humidity:real) 
[ 
'2020-04-23T23:11:51.903Z', '637086755205674255', 24, 64.8, 68, 
'2020-04-23T23:12:00.143Z', '637085868243706792', 34, 70.3, 61.8, 
'2020-04-23T23:13:07.308Z', '637086755205674255', 83, 61.3, 68.8, 
'2020-04-23T23:13:15.584Z', '637085868243706792', 69, 53.8, 64.6, 
'2020-04-23T23:14:22.714Z', '637086755205674255', 19, 67.9, 77.6, 
'2020-04-23T23:14:30.989Z', '637085868243706792', 6, 72.9, 67.7, 
'2020-04-23T23:15:37.516Z', '637086755205674255', 45, 53.5, 56.8, 
'2020-04-23T23:15:45.766Z', '637085868243706792', 66, 49.1, 72.3, 
'2020-04-23T23:16:52.951Z', '637086755205674255', 48, 70, 66.8, 
'2020-04-23T23:17:01.226Z', '637085868243706792', 50, 71.1, 70.3, 
'2020-04-23T23:59:52.951Z', '637086755205674255', 48, 70, 66.8, 
'2020-04-23T23:59:01.226Z', '637085868243706792', 50, 71.1, 70.3 
];
let StartDateTime = datetime('2020-04-23 20:00');
let EndDateTime = datetime('2020-04-24 05:00');
DeviceTelemetry
| make-series count() default=0 on TimeStamp from StartDateTime to EndDateTime step 1m by DeviceId
| mv-apply TimeStamp to typeof(datetime), count_ to typeof(int)
on 
(     
    order by TimeStamp asc
    | extend PrvTimeStamp=iif(isempty(prev(TimeStamp)), StartDateTime,prev(TimeStamp)) 
    | extend PrvCount=iif(isempty(prev(count_)), -1,prev(count_)) 
    | extend MissingDataStartTime= iff(PrvCount > 0 and count_==0, TimeStamp, iif(PrvCount==-1, TimeStamp,todatetime('')))
    | extend MissingDataEndTime= iff(PrvCount == 0 and count_>0, TimeStamp,  iif(isempty(next(TimeStamp)), TimeStamp,todatetime('')))
    | where isnotempty(MissingDataStartTime) or isnotempty(MissingDataEndTime)
    | extend MissingDataEndTime = iif(isempty(MissingDataEndTime), next(MissingDataEndTime),todatetime(''))
    | where PrvCount != 0
    | project DeviceId, MissingDataStartTime, MissingDataEndTime, TotalDurationInMin = datetime_diff('minute',MissingDataEndTime, MissingDataStartTime)
)

Sample output:

Scenerio2Output.JPG

 

Explanation of the above sample query:

  1. Create an in-memory DeviceHealth table to get some sample data.
  2. Apply the time filter
  3. Using make-series, create timeseries data set that returns a lists for TimeStamp and count of records aggregated at 1 min window by DeviceID.
  4. Use mv-apply operator to expand each List in into a sub-table, apply a sub-query to each sub-table, and returns the union of the results of all sub-queries.
  5. In the sub queries, order the records by TimeStamp – this also materializes the records so that we can apply the window function prev() and next() to get the previous/next row data to detect a state change.
  6. Finally, we eliminate the unwanted rows and columns and return the final dataset.

 

Scenario 3: Compute downtime based on threshold

Building on the scenario #2, we want to detect all the time slots where the device did not report any status in a 30 min window.

 

Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples

let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string)
[
'2020-04-24 10:40:00', '12987679',
'2020-04-24 10:40:00', '21998045',
'2020-04-24 11:00:00', '12987679',
'2020-04-24 11:05:00', '12987679',
'2020-04-24 11:49:00', '12987679',
'2020-04-24 11:30:00', '21998045',
'2020-04-24 12:30:00', '21998045',
'2020-04-24 14:30:00', '21998045',
'2020-04-24 12:45:00', '12987679',
'2020-04-24 12:48:00', '12987679',
'2020-04-24 13:00:00', '12987679',
'2020-04-24 12:44:00', '21998045',
'2020-04-24 16:14:00', '21998045',
'2020-04-24 16:30:00', '21998045',
];
let outage_threshold=30m;
DeviceHealth
| order by DeviceId asc, TimeStamp asc
| extend dt=iff(DeviceId != prev(DeviceId), 0s, TimeStamp-prev(TimeStamp))
| extend State=iff(DeviceId != prev(DeviceId), 1, iff(dt >= outage_threshold, 0, 1))
| extend PrvTimeStamp=prev(TimeStamp)
| extend PrvState=prev(State)
| extend OutageStartTime= iff(State == 0, PrvTimeStamp, todatetime(''))
| extend OutageEndTime= iff(State==0, TimeStamp,  todatetime('')) 
| where isnotempty(OutageEndTime) and isnotempty(OutageEndTime)
| project DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime)

Sample output:

Scenerio3Output.JPG

 

Explanation of the above sample query:

  1. Create an in-memory DeviceHealth table to get some sample data.
  2. Apply the time filter.
  3. Order the records by Device ID and TimeStamp – this also materializes the records so that we can apply the window function prev() and next() to get the previous/next row data to detect a state change.
  4. Using the Window functions, we are identifying the records with a stage change i.e. records having gap of over 30 min (Outage Threshold).
  5. Identify the start time and end time of the outage.
  6. Finally, we eliminate the unwanted rows and columns and return the final dataset.

 

Scenario 4: Calculate uptime for each device for a given period

Finally, let’s build on Scenario #1 and calculate the total availability of the device. 

 

Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples

let StartTime = datetime('2020-04-23');
let EndTime = datetime('2020-04-25');
let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string, State:int)
[
'2020-04-24 10:40:00', '12987679', 0,
'2020-04-24 10:40:00', '21998045', 0,
'2020-04-24 11:00:00', '12987679', 1,
'2020-04-24 11:05:00', '12987679', 1,
'2020-04-24 11:49:00', '12987679', 0,
'2020-04-24 11:30:00', '21998045', 1,
'2020-04-24 12:30:00', '21998045', 0,
'2020-04-24 14:30:00', '21998045', 1,
'2020-04-24 12:45:00', '12987679', 0,
'2020-04-24 12:48:00', '12987679', 0,
'2020-04-24 13:00:00', '12987679', 1,
'2020-04-24 16:14:00', '21998045', 0,
'2020-04-24 16:30:00', '21998045', 1,
];
DeviceHealth
| where TimeStamp between (StartTime .. EndTime)
| summarize TimeStampList = make_list(TimeStamp), StateList=make_list(State) by DeviceId
| mv-apply TimeStampList to typeof(datetime), StateList to typeof(string)
on
(
    order by TimeStampList asc
    | extend PrvTimeStamp=prev(TimeStampList)
    | extend PrvState=prev(StateList)
    | extend OutageStartTime= iff((PrvState == 1 or isempty(PrvState)) and StateList==0, TimeStampList, todatetime('')) 
    | extend OutageEndTime= iff(PrvState == 0 and StateList==1, TimeStampList,  todatetime('')) 
    | where isempty(OutageStartTime) <> isempty(OutageEndTime)
    | extend OutageStartTime = iif(isempty(OutageStartTime), prev(OutageStartTime),OutageStartTime)
    | where isnotempty(OutageEndTime)
    | project DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime)
)
| extend TotalMin = toreal(datetime_diff('minute', EndTime, StartTime))
| summarize Availability=((max(TotalMin)-sum(TotalDurationInMin))/max(TotalMin))*100  by DeviceId

Sample output:

Scenerio4Output.JPG  

Explanation of the above sample query:

  1. All steps same as scenario #1, except add two additional lines of code at the bottom to do the availability calculation.

 

Next steps

  1. Refer Azure Data Explorer documentation for additional reference.
  2. These queries can be further optimized for large data sets by using query hint called shufflekey, this will share the load on all cluster nodes where each node will process one partition of the data. You can also look into defining a data partitioning policy on your table. Queries in which the shufflekey is also the table's hash partition key are expected to perform better, as the amount of data required to move across cluster nodes is significantly reduced.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.