This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.
Query optimizations in a relational query engine can be broadly classified as logical query optimizations and physical query optimizations. Logical query optimizations generally refer to query optimizations that can be derived based on relational algebra independent (Calcite) of the physical layer in which query is executed. Physical query optimizations are query optimizations that are cognizant of physical layer primitives. For Hive, the physical layer implies Map-Reduce and Tez primitives.
Currently logical query optimizations in Hive can be broadly categorized as follows:
- Projection Pruning
- Deducing Transitive Predicates
- Predicate Push down
- Merging of Select-Select, Filter-Filter in to single operator
- Multi-way Join
- Query Rewrite to accommodate for Join skew on some column values
Physical optimizations in Hive can be broadly classified as follows:
- Partition Pruning
- Scan pruning based on partitions and bucketing
- Scan pruning if query is based on sampling
- Apply Group By on the map side in some cases
- Perform Join on the Mapper
- Optimize Union so that union can be performed on map side only
- Decide which table to stream last, based on user hint, in a multi way join
- Remove unnecessary reduce sink operators
- For queries with limit clause, reduce the number of files that needs to be scanned for the table.
- For queries with Limit clause, restrict the output coming from mapper by restricting what Reduce Sink operator generates.
- Reduce the number of Tez jobs needed for answering user submitted SQL query
- Avoid Map-Reduce jobs in case of simple fetch query
- For simple fetch queries with aggregation, perform aggregation without Map-Reduce tasks
- Rewrite Group By Query to use index table instead of original table
- Use Index scans when predicates above table scan is equality predicates and columns in predicate have indexes on it.
Cost Based Optimization:
History:
Calcite is an open source, Apache Licensed, query planning and execution framework. Relational algebra is at the heart of Calcite. Every query is represented as a tree of relational operators. You can translate from SQL to relational algebra, or you can build the tree directly.
Planner rules transform expression trees using mathematical identities that preserve semantics. For example, it is valid to push a filter into an input of an inner join if the filter does not reference columns from the other input.
Calcite optimizes queries by repeatedly applying planner rules to a relational expression. A cost model guides the process, and the planner engine generates an alternative expression that has the same semantics as the original but a lower cost.
The planning process is extensible. You can add your own relational operators, planner rules, cost model, and statistics.
Enabling the CBO:
The CBO is enabled by default in Hive 0.14 and later. If you need to enable it manually, set the following property in hive-site.xml:
SET hive.cbo.enable=true;
For the physical optimizer, set the following properties in hive-site.xml to generate statistics:
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;
Hive Query Planning
CBO Flow
Gathering Statistics--Critical to the CBO:
The CBO requires both table-level and column-level statistics to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization.
Table-level statistics:
Table-level statistics should always be collected. Make sure the following property is set as follows in hive-site.xml to collect table-level statistics:
SET hive.stats.autogather=true;
- This is applicable for newly created tables and/or partitions (that are populated through the INSERT OVERWRITE command), statistics are automatically computed by default.
- This would add to the execution time of the queries along with Reducers tasks being created.
If you have an existing table that does not have statistics collected, you can collect them by running the following query:
ANALYZE TABLE <table_name> COMPUTE STATISTICS;
Column-level statistics (critical):
Column-level statistics are expensive to compute and are not yet automated. The recommended process to use for Hive 0.14 and later is to compute column statistics for all of your existing tables using the following command:
ANALYZE TABLE <table_name> COMPUTE STATISTICS for COLUMNS;
As new partitions are added to the table, if the table is partitioned on "col1" and the new partition has the key "x," then you must also use the following command:
ANALYZE TABLE <table_name> partition (coll="x") COMPUTE STATISTICS for COLUMNS;
Please note: Analyze statements are recommended to run when the cluster is idle (no production load). This query runs are heavy and would acquire most resources on YARN and locks on the Hive tables.
Quick Overview:
Description |
Stored in |
Collected by |
Since |
Number of partition the dataset consists of |
Fictional metastore property: numPartitions |
computed during displaying the properties of a partitioned table |
|
Number of files the dataset consists of |
Metastore table property: numFiles |
Automatically during Metastore operations |
|
Total size of the dataset as its seen at the filesystem level |
Metastore table property: totalSize |
|
|
Uncompressed size of the dataset |
Metastore table property: rawDataSize |
Computed, these are the basic statistics. Calculated automatically when hive.stats.autogather is enabled. |
|
Number of rows the dataset consist of |
Metastore table property: numRows |
|
|
Column level statistics |
Metastore; TAB_COL_STATS table |
Computed, Calculated automatically when hive.stats.column.autogather is enabled. |
|
Validate the CBO with explain plans:
Hive provides an EXPLAIN command that shows the execution plan for a query.
Example:
Output:
Upcoming Hive 4.1 feature wrt Hive explain and CBO clause:
The CBO clause outputs the plan generated by Calcite optimizer. It can optionally include information about the cost of the plan using Calcite default cost model and cost model used for join reordering.
Available from: Hive release 4.0.0 (HIVE-17503 / HIVE-21184).
COST option prints the plan and cost calculated using Calcite default cost model.
JOINCOST option prints the plan and cost calculated using the cost model used for join reordering.
Sample output:
It will produce a similar plan, but the cost for each operator will be embedded next to the operator descriptors:
CBO PLAN:
References:
https://cwiki.apache.org/confluence/display/Hive/Cost-based+optimization+in+Hive