Integration of Azure Data Explorer with Cosmos DB for near real-time analytics

This post has been republished via RSS; it originally appeared at: Azure Data Explorer articles.

In the world of Azure data platform, Cosmos DB and Azure Data Explorer are two big data systems that complement each other across operational and analytical workloads respectively. You could make better business decisions if you get an ability to analyse data in near real-time and that’s what the integration of these two systems empowers you to achieve. 

 

Azure Data Explorer(ADX) is an append-only big data analytical database for low latency near real-time analytics scenarios. Azure Cosmos DB is a globally distributed, multi-model NoSQL database.  

 

In traditional data analytics solutions, you have to go through the time consuming curation processes to bring data in a shape to be consumed by the end users. 

 

Now with these advanced systems, it is practically possible to ingest and query raw operational data in easy and effective manner as depicted in the following solution architecture  

 

RefArch.png

 

Benefits of this pattern 

  • Readily available operational data for analysis as opposed to waiting for hours to get the data. 
  • Querying data without impacting the online transactional processing(OLTP) system's performance. 
  • Keeps operational database size and cost smallmoves historical data to the cost efficient store that is optimized for analytics. 
  • Drill down from analytic aggregates always point to the fresh data. 

 

Benefits of Azure Data Explorer in this pattern 

  • Provides ability to ingest fast flowing high volumes of streaming data with low latency. 
  • Extremely fast interactive queries over fresh and large data sets in cost efficient manner. 
  • Build easily, quick, performant and no cost near real-time dashboards. 
  • Export data to a well partitioned, compressed Azure data lake storage leveraging ADX external tables and seamlessly query across warm and cold store. 

 

Key Features of Azure Data Explorer 

  • Low latency batch and streaming ingestions for high volumes of data. 
  • Unmatched performance for querying over large data sets in cost efficient manner. It supports KQL(Kusto Query Language), T-SQL, inline R and Python. 
  • Supports structured, semi-structured(JSON and XML) and unstructured(free text) data. It has a rich set of capabilities for time series analysis, advanced analytics, geospatial features and analytics on logs of all types. 
  • Automatically indexes and compresses data on ingestionstores it in an append only columnar database. 
  • Provides enterprise grade features including VNet injection, BYOK, encryption at rest, RBAC, row level security, monitoring and DevOps. 
  • With Azure Data Share follower cluster, you can easily and securely share data with people in your organisation or external parties. 
  • Continuously export data from ADX to an external table(it’s a ADX schema entity that references data stored outside of ADX database). 
  • Cost optimization features like pause/resume cluster, auto-scaleeasy caching configurations to define data retention in warm and cold store.   

 

Use cases 

There are numerous potential benefits of this architecture from business growth perspective. Just to give you an idea on its value proposition which is applicable to most organizations across diverse industries, sharing few examples - 

  • In e-commerce system, any activities performed by user can be logged for consumer behavior trend analysisuser profile changes to track profile history. 
  • In energy, manufacturing and mining industry, plants/machines/IoT devices state change history for troubleshooting, trend analysis, predictive maintenance, avoiding negative consequences scenarios. 
  • In automotive industry, vehicle telemetry analytics for predictive insights on vehicle healthroad and driving safety. 

Similarly in health, finance and many other industries, heaps of scenarios where you could make better business decisions using this pattern. 

 

Cost optimization 

The next obvious question would be around data redundancy and cost impact of this solution. You could optimize the cost of this solution by managing the data retention policies across all these services. For example, Cosmos DB is an operational hot store where data is stored for few days, Azure Data Explorer is an analytical warm store where frequently accessed data is stored, export rest of the old data to cold storage which is Azure BLOB or data lake storage in this solution. It is very easy to configure caching and data retention policies in ADX so you could easily change it, seamlessly query across warm and cold store depending on your requirements.  

 

Demonstration of solution with hands on lab 

To help you understand the end-to-end flow of this solution, hands on lab with step by step guidance has been put together along with working code samples so you can try and test it on your own with the simulated data. Brief on what is being covered in this lab – 

  • Sample data will be simulated using data generator component which will insert the data into Cosmos DB. 
  • Leverage Cosmos DB change feed feature to trigger Azure function to push down every change in Cosmos DB. 
  • Use streaming capability of Azure Data explorer to ingest the streamed data via Azure Event Hub. 
  • Run interactive queries using KQL(Kusto query language) with the glimpses of advanced scenarios like forecasting, anomaly detection and time series analysis. 
  • In last module of lab, you will have lot of fun building near real-time dashboard using ADX dashboards. 

The lab is publicly available here at GitHub. 

Try it out and share your feedback! 

 

Note 

Near real-time analytics solution can be built in multiple ways using different azure services, this lab describes one of the possible scenarios. Similar outcomes can be achieved using other azure services which are not covered in this lab. 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.