ADX Query Performance Unleashed: Best Practices and Pro Tips

This post has been republished via RSS; it originally appeared at: Azure Data Explorer Blog articles.

Azure Data Explorer (ADX) is a powerful tool that enables this by offering real-time data analytics at scale. However, to harness the full potential of ADX, it's essential to optimize query performance. In this article, we will explore the best practices for fine-tuning ADX queries to achieve maximum efficiency in real-world scenarios.

The Importance of ADX Query Performance Optimization

A query's performance depends directly on the amount of data it needs to process. The less data is processed, the quicker the query (and the fewer resources it consumes). Therefore, the most important best-practice is to structure the query in such a way that reduces the amount of data being processed.

Poorly performing queries can hinder these critical processes and result in less productivity. ADX, with its powerful query engine(KQL), is well-equipped to handle vast datasets. Still, to ensure optimal performance, one must understand how to fine-tune queries and apply best practices. This not only accelerates data retrieval but also reduces costs and enhances the overall user experience.

Fine-tuning Azure Data Explorer (ADX) queries using Kusto Query Language (KQL) with big data involves understanding the cluster settings, optimizing your queries to ensure they perform efficiently, especially when dealing with large datasets.

Understanding Cluster Sizing:

When it comes to choosing the right infrastructure for your ADX environment, selecting the appropriate SKU (Stock Keeping Unit) is crucial. ADX offers storage-optimized and compute-optimized SKUs, each tailored to specific workloads. Storage-optimized SKUs prioritize data storage capacity, making them ideal for scenarios where data retention and archival are paramount. On the other hand, compute-optimized SKUs emphasize processing power, making them suitable for workloads demanding higher computational performance.
If the queries involve extensive data retrieval, aggregation, and analysis, a compute-optimized SKU may be more appropriate, emphasizing processing power to efficiently handle computational demands.
On the other hand, if the primary focus is on data storage, retention, and archival without frequent, resource-intensive queries, a storage-optimized SKU might prove more cost-effective. This SKU prioritizes data storage capacity and is well-suited for scenarios where long-term storage needs take precedence over real-time computational performance.
So, conducting this quick check of cluster SKU allows optimize the cluster configuration for both query performance and cost efficiency, ensuring that resources are allocated effectively based on the unique requirements of the data and analytical tasks at hand.

Cluster Caching:

ADX, or Azure Data Explorer, provides the ability to optimize query performance through its hot cache settings at both the database and entity levels. In ADX, a hot cache refers to the portion of data stored in memory, readily accessible for quicker query responses. On the other hand, a cold cache involves data residing on disk, requiring more time to retrieve. When applied at the database level, these settings impact the entire data repository, influencing how data is cached and subsequently accessed.

At the entity level, which includes specific tables or materialized views, the hot cache settings can be fine-tuned to cater to the unique characteristics of individual data structures. This level of granularity allows for a more tailored optimization strategy, focusing on the specific entities that are frequently queried or require expedited access.

Checking the caching policy at database level:

Checking the caching at Entity (Table/Materialized view) level:

Note: In the above screenshot, since the hot cache policy was set at database level, the entity level policy is null.

If the cache policy is not set at cluster/entity level, use the below commands to set at cluster/table/materialized view level.

Checking Retention Policy:

A retention policy is like a rulebook for keeping or removing old data. A good retention policy in also helps with the 'hot cache' – it keeps the most important and frequently used data readily available.

First, we look into the cluster settings, hot cache, auto scaling, retention policy, to find any issues that might slow down our queries. If everything looks okay, below are some tips to make your queries work better. Let's dive in and make your queries work like a charm!

Tips for query performance

Use Efficient Query Syntax
- Write clear and concise queries and use filters and projections to minimize the amount of data retrieved.
- Avoid unnecessary columns usage in the query.
- Follow the best practices provided by ADX Query best practices
Use the Materialized Views
- Use materialized views for pre-aggregated or precomputed data to accelerate query processing instead of accessing the tables and deduping the data at runtime. You can create a materialized view of every table with required data to precompute the data that improves the query performance.
- Keep materialized views up to date with a scheduled refresh mechanism.
Store the query_results in the cache
- Set the cache time to store the query results for frequently executed queries
- For example : set query_results_cache_max_age = time(10m); <<your Query>>
  - (Stores the query result for 10mins, any requests for the same query execution will be returned from the cache without the need of execution again)
- Below is an example, where the 1st time query execution takes a long time due to the nature of the query. However, the subsequent query executions are faster as the results are stored in the cache within the specified time in the query.
Use hint.strategy for Joins & Summarize
- The 'hint.strategy' command allows you to provide hints to the query optimizer to improve query performance. For example, you can specify which indexes to use or how to join tables efficiently.
- When the cardinality of the query is very high, the right hint strategy boost the performance of the query when its rightly used.
- Below is an example of very high cardinality of a query that aggregates millions of data and shows the query execution difference when used without hint.strategy and with hint.strategy
Usage of the right "hint.strategy"
- Use the proper hint.strategies to speed the performance. Different strategies are helpful in different ways to improve the query execution.
  - hint.strategy = broadcast - Today, regular joins are executed on a single cluster node. Broadcast join is an execution strategy of join that distributes the join over cluster nodes. This strategy is useful when the left side of the join is small (up to several tens of MBs). In this case, a broadcast join is more performant than a regular join.
  - hint.strategy = shuffle - It's better to use the shuffle query strategy when the shuffle key (a join key, summarize key, make-series key or partition key) has a high cardinality and the regular operator query hits query limits.
    - when using summarize(), shuffle strategy improves the performance of the query. The shuffle strategy query with summarize operator shares the load on all cluster nodes, where each node processes one partition of the data.
    - However,In some cases, thehint.strategy=shufflewill be ignored, and the query won't run in shuffle strategy.
- To overcome the nested hint.strategy=shuffle issue and run in shuffle strategy, choose the key that is common for the summarize and join operations. Instead use hint.shufflekey = <<shuffleKey>>
Lookup (vs) Join usage:

Lookup operations in ADX involve searching for precomputed values based on specified criteria. This approach is particularly useful when dealing with reference or dimension tables that contain static data. By using lookups, you can swiftly retrieve relevant information without the need for complex join operations, resulting in faster query execution.

On the other hand, joins in ADX enable the merging of data from multiple tables, facilitating a more comprehensive analysis of interconnected information. This is especially beneficial when dealing with dynamic or transactional data spread across different tables.
Aggregation Techniques:
- Use aggregation functions to reduce the volume of data processed during queries. Aggregations can summarize data and provide insights without the need to scan the entire dataset. For example, use 'summarize' to aggregate data before analyzing it further.
Data Types and Compression:
- Use appropriate data types to minimize storage and improve query performance.
- Utilize compression settings to reduce storage requirements and speed up data retrieval.
Limit Query Output:
- When dealing with big data, limit the amount of data returned in your query results. You can use 'take,' 'limit,' or 'project' to restrict the number of rows and columns returned, making the query more efficient.

By implementing these strategies, you can efficiently extract insights from large datasets while optimizing query performance in Azure Data Explorer using KQL.

Conclusion

Optimizing ADX query performance is essential for harnessing the full potential of real-time data analytics. By implementing the best practices and techniques discussed in this article, you can unlock faster insights, reduce costs, and improve overall productivity.

Related Articles:

KQL Query best practices - Best practices for Kusto Query Language queries - Azure Data Explorer & Real-Time Analytics | Microsoft Learn

Optimization - Optimize for high concurrency with Azure Data Explorer - Azure Data Explorer | Microsoft Learn

Query Consistency - Query consistency - Azure Data Explorer | Microsoft Learn

Partial Query failures - Partial query failures - Azure Data Explorer | Microsoft Learn