Performance Improvements to Apache Spark 3.1.2 in Azure Synapse

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Azure Synapse Analytics is continually focused on delivering a highly performant and scalable platform for supporting Spark Workload. We are focused on improving the query performance for the typical workload patterns that we see with our customers. By combining the latest open-source updates in Apache Spark with our team’s focus on performance updates we have made significant performance gains in standard TPC-DS benchmarking tests.

 

balajisankaran_3-1632203699454.png

Apache Spark in Azure Synapse - TPC-DS  Benchmark Performance improvements

 

We have recently announced General Availability of Apache Spark 3.1.2 as a part of Azure Synapse Analytics. This delivers significant performance improvements over Apache Spark 2.4. In the new release of Azure Synapse Analytics, we have been able to achieve a 13% improvement in performance from the previous release and TPC-DS performance that is 202% faster than Apache Spark 3.1.2. This means you can do more with your data, faster and at a lesser cost.

 

balajisankaran_4-1632203699455.png

Improvements at query level

 

Previously, we had improved Apache Spark performance through Query Optimization, Autoscaling, Cluster Optimizations, Intelligent Caching, and Indexing. In the latest release we have further improved Apache Spark 3.1.2 in Azure Synapse Analytics by using the following 3 optimizations:

  • Limit Pushdown
  • Optimized Sorts
  • Bloom Filter Enhancements

Limit Pushdown

This optimization applies while performing Top-K queries by eliminating compute cycles involved in processing rows which are not part of the Top-K within the partition.

For example, when identifying the top-selling products across categories, where data is partitioned by categories, identifying the top-k rows within a shuffle and comparing just those that fall within the top-k across partitions will eliminate the need for processing other rows.

Statistics must be enabled to trigger this optimization.

 

Optimized Sort

Sorting is one of the most used and computationally expensive operations along with aggregations. In Synapse Spark, we have written an optimized an implementation of sorting which benefits from prior partitioning of data.

 

This new algorithm can leverage cardinality information to create multiple sorters and efficiently use prefix comparison. Prefix comparisons are way faster than record comparisons. For sorting on multiple columns, we reorder sorting columns to reduce the number of record comparisons required.

 

This is very useful for queries requiring window operation like getting top 100 highly paid employees in each department or getting 100 most selling products in different categories.

 

Bloom filter enhancements

In this release, we have extended support of Bloom filters to Sort merge joins in addition to Broadcast Hash joins which we talked about previously.

 

Shuffling is a bottleneck in query execution as it requires data to be written on the disk. We have further enhanced Bloom filter implementation in Synapse Spark to operate on sort merge joins. The idea is to create Bloom filters from the smaller tables and leverage them to prune large tables. This will help in reducing shuffle data and thus improving query performance. With this extension we were able to reduce shuffle in TPC-DS by 50%.

 

For example, given a fact table ’Sales’ and a dimension table ‘Items’, application of Bloom filter will drastically improve performance when we want to get total sales for selected items.

 

Limit Push-down and Optimized Sort are effective when Stats are enabled & Command Analyze Table is run

Summary

Continuous improvements to performance tuning and optimizations enable you to run your workloads cost-effectively and reduce processing times

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.