This post has been republished via RSS; it originally appeared at: Azure Data Explorer articles.
Customers interacting with Time Series, IoT Analytics, and Infra/App Logs often have a challenge identifying the total downtime of a device or an application. In this blog we will look at some of the common scenarios and see how we can use the power of Kusto Query Language to solve this challenge. Please note that these sample queries do not cover all the edge cases, please use them as a reference and adjust the logic for your scenario.
Scenario 1: Compute downtime based on health status
In this scenario, the devices are reporting health status as ‘Up’ or ‘Down’ or a binary value 1 or 0 or anything else. Our task is to identify how long the device was in a specific state. In this case we are interested in how long the device was down. Check the below sample query to identify the change in status and compute the start and end time of a given state.
Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples
let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string, State:int) [ '2020-04-24 10:40:00', '12987679', 0, '2020-04-24 10:40:00', '21998045', 0, '2020-04-24 11:00:00', '12987679', 1, '2020-04-24 11:05:00', '12987679', 1, '2020-04-24 11:49:00', '12987679', 0, '2020-04-24 11:30:00', '21998045', 1, '2020-04-24 12:30:00', '21998045', 0, '2020-04-24 14:30:00', '21998045', 1, '2020-04-24 12:45:00', '12987679', 0, '2020-04-24 12:48:00', '12987679', 0, '2020-04-24 13:00:00', '12987679', 1, '2020-04-24 16:14:00', '21998045', 0, '2020-04-24 16:30:00', '21998045', 1, '2020-04-24 16:31:00', '21998045', 0 ]; DeviceHealth | where TimeStamp between (datetime('2020-04-23') .. datetime(2020-04-25)) | summarize TimeStampList = make_list(TimeStamp), StateList=make_list(State) by DeviceId | mv-apply TimeStampList to typeof(datetime), StateList to typeof(string) on ( order by TimeStampList asc | extend PrvTimeStamp=prev(TimeStampList) | extend PrvState=prev(StateList) | extend OutageStartTime= iff((PrvState == 1 or isempty(PrvState)) and StateList==0, TimeStampList, todatetime('')) | extend OutageEndTime= iff(PrvState == 0 and StateList==1, TimeStampList, todatetime('')) | where isempty(OutageStartTime) <> isempty(OutageEndTime) | extend OutageStartTime = iif(isempty(OutageStartTime), prev(OutageStartTime),OutageStartTime) | extend OutageEndTime = iif(isempty(OutageEndTime ), next(OutageEndTime),OutageEndTime ) | summarize by DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime) )
Sample output:
Explanation of the above sample query:
- Create an in-memory DeviceHealth table to get some sample data.
- Apply the time filter
- Create lists for TimeStamp and State column by DeviceID.
- Use mv-apply operator to expand each List in into a sub-table, apply a sub-query to each sub-table, and returns the union of the results of all sub-queries.
- In the sub queries, order the records by TimeStamp – this also materializes the records so that we can apply the window function prev() to get the previous row data to detect a state change.
- Finally, we eliminate the unwanted rows and return the summary of downtime start and end time by DeviceId.
Scenario 2: Compute downtime based on missing signals
In some cases, the devices don’t report specific health status, instead just stop sending signals. The below sample query identifies missing signals and compute the start and end time when the signal was missing.
Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples
let DeviceTelemetry = datatable (TimeStamp:datetime, DeviceId:string, BatteryLevel:int, Temp:real, Humidity:real) [ '2020-04-23T23:11:51.903Z', '637086755205674255', 24, 64.8, 68, '2020-04-23T23:12:00.143Z', '637085868243706792', 34, 70.3, 61.8, '2020-04-23T23:13:07.308Z', '637086755205674255', 83, 61.3, 68.8, '2020-04-23T23:13:15.584Z', '637085868243706792', 69, 53.8, 64.6, '2020-04-23T23:14:22.714Z', '637086755205674255', 19, 67.9, 77.6, '2020-04-23T23:14:30.989Z', '637085868243706792', 6, 72.9, 67.7, '2020-04-23T23:15:37.516Z', '637086755205674255', 45, 53.5, 56.8, '2020-04-23T23:15:45.766Z', '637085868243706792', 66, 49.1, 72.3, '2020-04-23T23:16:52.951Z', '637086755205674255', 48, 70, 66.8, '2020-04-23T23:17:01.226Z', '637085868243706792', 50, 71.1, 70.3, '2020-04-23T23:59:52.951Z', '637086755205674255', 48, 70, 66.8, '2020-04-23T23:59:01.226Z', '637085868243706792', 50, 71.1, 70.3 ]; let StartDateTime = datetime('2020-04-23 20:00'); let EndDateTime = datetime('2020-04-24 05:00'); DeviceTelemetry | make-series count() default=0 on TimeStamp from StartDateTime to EndDateTime step 1m by DeviceId | mv-apply TimeStamp to typeof(datetime), count_ to typeof(int) on ( order by TimeStamp asc | extend PrvTimeStamp=iif(isempty(prev(TimeStamp)), StartDateTime,prev(TimeStamp)) | extend PrvCount=iif(isempty(prev(count_)), -1,prev(count_)) | extend MissingDataStartTime= iff(PrvCount > 0 and count_==0, TimeStamp, iif(PrvCount==-1, TimeStamp,todatetime(''))) | extend MissingDataEndTime= iff(PrvCount == 0 and count_>0, TimeStamp, iif(isempty(next(TimeStamp)), TimeStamp,todatetime(''))) | where isnotempty(MissingDataStartTime) or isnotempty(MissingDataEndTime) | extend MissingDataEndTime = iif(isempty(MissingDataEndTime), next(MissingDataEndTime),todatetime('')) | where PrvCount != 0 | project DeviceId, MissingDataStartTime, MissingDataEndTime, TotalDurationInMin = datetime_diff('minute',MissingDataEndTime, MissingDataStartTime) )
Sample output:
Explanation of the above sample query:
- Create an in-memory DeviceHealth table to get some sample data.
- Apply the time filter
- Using make-series, create timeseries data set that returns a lists for TimeStamp and count of records aggregated at 1 min window by DeviceID.
- Use mv-apply operator to expand each List in into a sub-table, apply a sub-query to each sub-table, and returns the union of the results of all sub-queries.
- In the sub queries, order the records by TimeStamp – this also materializes the records so that we can apply the window function prev() and next() to get the previous/next row data to detect a state change.
- Finally, we eliminate the unwanted rows and columns and return the final dataset.
Scenario 3: Compute downtime based on threshold
Building on the scenario #2, we want to detect all the time slots where the device did not report any status in a 30 min window.
Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples
let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string) [ '2020-04-24 10:40:00', '12987679', '2020-04-24 10:40:00', '21998045', '2020-04-24 11:00:00', '12987679', '2020-04-24 11:05:00', '12987679', '2020-04-24 11:49:00', '12987679', '2020-04-24 11:30:00', '21998045', '2020-04-24 12:30:00', '21998045', '2020-04-24 14:30:00', '21998045', '2020-04-24 12:45:00', '12987679', '2020-04-24 12:48:00', '12987679', '2020-04-24 13:00:00', '12987679', '2020-04-24 12:44:00', '21998045', '2020-04-24 16:14:00', '21998045', '2020-04-24 16:30:00', '21998045', ]; let outage_threshold=30m; DeviceHealth | order by DeviceId asc, TimeStamp asc | extend dt=iff(DeviceId != prev(DeviceId), 0s, TimeStamp-prev(TimeStamp)) | extend State=iff(DeviceId != prev(DeviceId), 1, iff(dt >= outage_threshold, 0, 1)) | extend PrvTimeStamp=prev(TimeStamp) | extend PrvState=prev(State) | extend OutageStartTime= iff(State == 0, PrvTimeStamp, todatetime('')) | extend OutageEndTime= iff(State==0, TimeStamp, todatetime('')) | where isnotempty(OutageEndTime) and isnotempty(OutageEndTime) | project DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime)
Sample output:
Explanation of the above sample query:
- Create an in-memory DeviceHealth table to get some sample data.
- Apply the time filter.
- Order the records by Device ID and TimeStamp – this also materializes the records so that we can apply the window function prev() and next() to get the previous/next row data to detect a state change.
- Using the Window functions, we are identifying the records with a stage change i.e. records having gap of over 30 min (Outage Threshold).
- Identify the start time and end time of the outage.
- Finally, we eliminate the unwanted rows and columns and return the final dataset.
Scenario 4: Calculate uptime for each device for a given period
Finally, let’s build on Scenario #1 and calculate the total availability of the device.
Execute: [Web] [Desktop] https://help.kusto.windows.net/Samples
let StartTime = datetime('2020-04-23'); let EndTime = datetime('2020-04-25'); let DeviceHealth = datatable (TimeStamp:datetime, DeviceId:string, State:int) [ '2020-04-24 10:40:00', '12987679', 0, '2020-04-24 10:40:00', '21998045', 0, '2020-04-24 11:00:00', '12987679', 1, '2020-04-24 11:05:00', '12987679', 1, '2020-04-24 11:49:00', '12987679', 0, '2020-04-24 11:30:00', '21998045', 1, '2020-04-24 12:30:00', '21998045', 0, '2020-04-24 14:30:00', '21998045', 1, '2020-04-24 12:45:00', '12987679', 0, '2020-04-24 12:48:00', '12987679', 0, '2020-04-24 13:00:00', '12987679', 1, '2020-04-24 16:14:00', '21998045', 0, '2020-04-24 16:30:00', '21998045', 1, ]; DeviceHealth | where TimeStamp between (StartTime .. EndTime) | summarize TimeStampList = make_list(TimeStamp), StateList=make_list(State) by DeviceId | mv-apply TimeStampList to typeof(datetime), StateList to typeof(string) on ( order by TimeStampList asc | extend PrvTimeStamp=prev(TimeStampList) | extend PrvState=prev(StateList) | extend OutageStartTime= iff((PrvState == 1 or isempty(PrvState)) and StateList==0, TimeStampList, todatetime('')) | extend OutageEndTime= iff(PrvState == 0 and StateList==1, TimeStampList, todatetime('')) | where isempty(OutageStartTime) <> isempty(OutageEndTime) | extend OutageStartTime = iif(isempty(OutageStartTime), prev(OutageStartTime),OutageStartTime) | where isnotempty(OutageEndTime) | project DeviceId, OutageStartTime, OutageEndTime, TotalDurationInMin = datetime_diff('minute',OutageEndTime, OutageStartTime) ) | extend TotalMin = toreal(datetime_diff('minute', EndTime, StartTime)) | summarize Availability=((max(TotalMin)-sum(TotalDurationInMin))/max(TotalMin))*100 by DeviceId
Sample output:
Explanation of the above sample query:
- All steps same as scenario #1, except add two additional lines of code at the bottom to do the availability calculation.
Next steps
- Refer Azure Data Explorer documentation for additional reference.
- These queries can be further optimized for large data sets by using query hint called shufflekey, this will share the load on all cluster nodes where each node will process one partition of the data. You can also look into defining a data partitioning policy on your table. Queries in which the shufflekey is also the table's hash partition key are expected to perform better, as the amount of data required to move across cluster nodes is significantly reduced.