How to minimize data movements (Compatible and Incompatible Joins)

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

In Azure Synapse Analytics, data will be distributed across several distributions based on the distribution type (Hash, Round Robin, and Replicated). So, on an operation like Join condition we may have Compatible Joins or Incompatible Joins which depends on the type of the joined table distribution type and location on the join (LEFT or RIGHT).

 

As part of query performance troubleshooting/ tuning, one of the main factors to enhance the query performance is to minimize the data movement between distributions. Customers can modify the join types / table distribution type (if applicable) to achieve that.

 

Below are two examples; the first shows how to get compatible join, and the other one shows how to minimize the data movement by modifying the table (incompatible join).

 

Examples:

compatible join example, for this example we have two tables [dbo].[DimCustomer] and [dbo].[FactSurveyResponse]

  • [dbo].[DimCustomer] distributed as a hash on CustomerKey with integer as a data type
  • [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey with integer as a data type as well.

Based on the below reference (Table 1: Compatible Joins) , if we are joining two tables on the same key and same date type no data movements will be required.

 

 

CREATE TABLE [dbo].[DimCustomer] ( [CustomerKey] [int] NOT NULL, [GeographyKey] [int] NULL, [CommuteDistance] [nvarchar](15) NULL ) WITH ( DISTRIBUTION = HASH ( [CustomerKey] ), CLUSTERED COLUMNSTORE INDEX ) CREATE TABLE [dbo].[FactSurveyResponse] ( [SurveyResponseKey] [int] NOT NULL, [DateKey] [int] NOT NULL, [CustomerKey] [int] NOT NULL, [ProductCategoryKey] [int] NOT NULL ) WITH ( DISTRIBUTION = HASH ( [CustomerKey] ), CLUSTERED COLUMNSTORE INDEX ) SELECT FSR.* , DC.LastName , DC.FirstName FROM [dbo].[FactSurveyResponse] FSR INNER JOIN [dbo].[DimCustomer] DC ON DC.[CustomerKey] = FSR.[CustomerKey]

 

 

By examining the produced query steps for the above query, no data movement happened as we have joined the two tables based on the same hash key (CustomerKey)

 

incompatible join example, for this example we have two tables [dbo].[ DimProductCategory] and [dbo].[FactSurveyResponse]

  • [dbo].[ DimProductCategory] distributed as a hash on ProductCategoryKey
  • [dbo].[FactSurveyResponse] distributed as a hash on CustomerKey as well.

Based on the below reference (Table 2: Incompatible Joins)  , if we are joining two tables on different keys additional data movements (e.g. BroadcastMoveOperation, ShuffleMoveOperation) will be required.

 

 

CREATE TABLE [dbo].[FactSurveyResponse] ( [SurveyResponseKey] [int] NOT NULL, [DateKey] [int] NOT NULL, [CustomerKey] [int] NOT NULL, [ProductCategoryKey] [int] NOT NULL ) WITH ( DISTRIBUTION = HASH ( [CustomerKey] ), CLUSTERED COLUMNSTORE INDEX ) CREATE TABLE [dbo].[DimProductCategory] ( [ProductCategoryKey] [int] NOT NULL, [ProductCategoryAlternateKey] [int] NULL, [EnglishProductCategoryName] [nvarchar](50) NOT NULL, [SpanishProductCategoryName] [nvarchar](50) NOT NULL, [FrenchProductCategoryName] [nvarchar](50) NOT NULL ) WITH ( DISTRIBUTION = HASH ( [ProductCategoryKey] ), CLUSTERED COLUMNSTORE INDEX ) SELECT FSR.* , DPC.[EnglishProductCategoryName] FROM [dbo].[FactSurveyResponse] FSR INNER JOIN [dbo].[DimProductCategory] DPC ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]

 

 

Steps-2.png 

Based on the above there was a BroadcastMoveOperation which moved 4 rows, since we have joined the two tables on different keys (incompatible Join). One of the ways to remove the BroadcastMoveOperation is to use REPLICATE distribution type.

 

Note: The customer needs to consider the changes not CSS, since this might affect other queries and requires a structure redesign for related objects. In the next step we will create a new table by using CTAS with REPLICATE distribution data type.

 

Steps to minimize the data movements (Just an example).

  • Create a new table with REPLICATE distribution by using CTAS, and verify that both left and right table has the predicate joins data type. (e.g. int = int)

Build the replicate cash.

 

 

CREATE TABLE [dbo].[DimProductCategory_Replicate] WITH (DISTRIBUTION=REPLICATE) AS SELECT * FROM [dbo].[DimProductCategory] SELECT FSR.* , DPC.[EnglishProductCategoryName] FROM [dbo].[FactSurveyResponse] FSR INNER JOIN [dbo].[DimProductCategory_Replicate] DPC ON DPC.[ProductCategoryKey] = FSR.[ProductCategoryKey]

 

  • Based on the Compatible Joins table we can get a compatible join either by having the replicated table on left or right.

By examining the produced query steps for the above query, no data movement happened. Based on the below reference (Table 1: Compatible Joins)  if the left table is distributed and the right table is replicated along with inner join no data movement is required.

 

Steps-3.png

Compatible and Incompatible Joins table reference:

 

Compatible Joins: is a join that doesn’t require data movement before each compute node.

 

Table 1: Compatible Joins

Join type

Left Table

Right Table

Compatibility

All join types

Replicated

Replicated

Compatible - no data movement required.

Inner Join

Right Outer Join

Cross Join

Replicated

Distributed

Compatible – no data movement required.

Inner Join

Left Outer Join

Cross Join

Distributed

Replicated

Compatible – no data movement required.

All join types, except cross joins, can be compatible.

Distributed

Distributed

Compatible – no data movement required if the join predicate is an equality join and if the predicate joins two distributed columns that have matching data types.

Cross Joins are always incompatible.

 

Example:

 

Incompatible Joins is a join that requires data movement before each compute node.

Note: Data movement operations take extra time and storage and can negatively impact query performance.

 

Table 2: Incompatible Joins

Join type

Left Table

Right Table

Compatibility

Left Outer Join

Full Outer Join

Replicated

Distributed

Incompatible – requires data movement before the join.

Right Outer Join

Full Outer Join

Distributed

Replicated

Incompatible – requires data movement before the join.

See Compatibility column for details.

Distributed

Distributed

Incompatible – requires data movement if joins have different keys and the predicate joins on the distributed columns have different data types.

Cross joins are always incompatible.

 

REMEMBER: these articles are REPUBLISHED. Your best bet to get a reply is to follow the link at the top of the post to the ORIGINAL post! BUT you're more than welcome to start discussions here:

This site uses Akismet to reduce spam. Learn how your comment data is processed.