Transforming Your Data Lake: Implementing Slow Change Dimension with Synapse

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

As organizations continue to collect and store large volumes of data in their data lakes, managing this data effectively becomes increasingly important. One key aspect of this is implementing Slow Change Dimension type 2, which allows organizations to track historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. In this blog post we will address the following scenario: a customer wants to implement Slow Change Dimension type 2 on top of their data lake.

For this example, we will use Serverless SQL Pool to demonstrate how this can be done. Additionally, in the next post, we will explore how the same approach can be used with Spark.

What is Slow Change Dimension Type 2?

Slow Change Dimension type 2: This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Check out this wiki to learn more, https://en.wikipedia.org/wiki/Slowly_changing_dimension

When implementing Slow Change Dimension type 2, there are various approaches that can be used depending on how the data is handled at the source. For example, if the source includes row version information or columns with flags for deleted or updated records, a different approach may be used compared to scenarios where this information is not available. In this particular scenario, we will follow a specific approach to address the challenges that are present.

Considerations for this Scenario:

This scenario is managing UPSERT and file versioning. Please note, updates are not supported on Serverless so I am workaround this with the external tables.
The files are not coming from the source with version, or information if the row was updated or deleted.
Deleted rows are not been handled on this solution, only inserted and updated ones.
Those files have a key value that is used to filter if they already exist or not on the destination. If you do not have a key column to compare. You will need to compare all the columns that the business considers as key values to determine if the information is new or not.
The source will send the information on the file regardless if it is a new row to be inserted or updated. In other words, the process of transformation and cleansing needs to understand if the value on the file refers to an updated row or a new row. So this solution tries to understand that scenario and based on that version it accordingly. If your source can send the rows in an accurate way in which it is clear which row is updated, you can change the steps accordingly.

Solution:

Consider as in the example below the folder sourcefiles_folder with all files in parquet format which were sent from the source to be compared to a destination. My destination is the external table: SCD_DimDepartmentGroup.

The file has the following columns:

Column	Datatype
DepartmentGroupKey	int	Not Null
ParentDepartmentGroupKey	int	NULL
DepartmentGroupName	nvarchar	Not Null

Here are the current values for this table - SCD_DimDepartmentGroup as Fig. 1, shows:

The last surrogate key is of the value of 8:

Fig. 1

Code - First Load and Dimension Creation:

CREATE EXTERNAL TABLE SCD_DimDepartmentGroup WITH ( LOCATION = '/SCD_DimDepartmentGroup', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS SELECT ROW_NUMBER () OVER (ORDER BY DepartmentGroupKey) ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,1 ID_valid ,0 ID_Deleted ,Curr_date as From_date ,Null as End_date FROM OPENROWSET( BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/', FORMAT = 'PARQUET' ) AS [SCD_DimDepartmentGroup]

Slow Change Dimension in 5 steps:

The implementation of the SCD is quite simple is basically filtering, persisting the data filtered, and comparing.

Current values on the table:

ID_Valid column will show if the row was versioned or not. 1 for a new row, 0 for a versioned row.
ID_Delete column will show if the row is still valid or not. 0 for a new not, 1 for an invalid
Curr_date is the date where the row was inserted into the table, a column on the SCD will be FromDate
ID_Surr is a surrogate key created for this table.
EndDate will manage the version per date

1) Step 1 - Create a CETAS for new values

TableA_SCD_DimDepartmentGroup_NEW - In this example, this step will basically insert all the values that exist on the files and do not exist on the destination into an External Table.

The column DepartmentGroupKey defines if the row is new or not. You will need a column or columns to understand if this is considered new information or not in your dimension.

Please note three columns were added to this example:

As Curr_date is the date where the row was inserted into the table. So we are using getdate()
1 ID_Valid which means this is a new valid row
1 ID_Delete which means this row was not deleted Or versioned.

Code - First, let's create the data source that will be used across all the external tables:

CREATE EXTERNAL DATA SOURCE SCD_serveless_dim WITH ( LOCATION = 'https://Storage.blob.core.windows.net/Container/SCD/transformation_folder/', CREDENTIAL = [MSI] )

Now, let's create the external table for Step 1:

Please note that I am using the maximum value of the surrogate key column ID_Surr from the main table SCD_DimDepartmentGroup and adding 1 to it to generate new surrogate values for rows that require them. If some values repeat at this point, it is okay because I will reuse this information with the Department Key to create a new surrogate key in the last step (Step 5), thus ensuring a unique value for each row. I am using the max+1 approach to ensure that existing surrogate keys are not affected.

DECLARE @ID_Surr AS INT SELECT @ID_Surr = MAX (ID_Surr)+1 FROM SCD_DimDepartmentGroup CREATE EXTERNAL TABLE TableA_SCD_DimDepartmentGroup_NEW WITH ( LOCATION = '/TableA_SCD_DimDepartmentGroup_NEW', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS SELECT @ID_Surr as ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,1 ID_valid ,0 ID_Deleted ,Getdate() as Curr_date FROM OPENROWSET( BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/', FORMAT = 'PARQUET' ---DELTA can be used here ) AS [SCD_DimDepartmentGroup_Silver] WHERE NOT EXISTS ( SELECT 1 FROM SCD_DimDepartmentGroup WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey )

Results in Fig. 2 - Step 1:

Fig. 2 - Step 1.

2) Step 2 - Create a CETAS for values that will be updated/versioned.

TableB_SCD_DimDepartmentGroup_OLD will be the next external table to be created by inserting all values that exist on the file exported from the source and do exist on the destination. Those value also should not exist on the external table created in Step 1 (A).

In this scenario, the data that will be versioned is not a new row (Step 1); it is rows that already exist in the destination, but something has changed. The DepartmentGroupKey is the key that will always remain the same, so if there is a new key, it means a new row, and the comparison will happen for any other column. Additionally, in this scenario, it is possible that the source may send the same information again without any changes. Therefore, comparing the data is necessary to ensure that the row sent is indeed a row to be versioned.

Please note for the rows to be versioned:

Flag ID_Valid was changed to 0
ID_Deleted was invalidated by changing the value to 1.
Curr_date here is the date the row was inserted on the table SCD_DimDepartmentGroup, so the From_date column is re-used.
NULL columns are been handled properly(ISNULL) to find the ones that in fact changed.

Code:

Please note I re-used my surrogate key column ID_Surr of the main table SCD_DimDepartmentGroup for the values that will be versioned.

CREATE EXTERNAL TABLE TableB_SCD_DimDepartmentGroup_OLD WITH ( LOCATION = '/TableB_SCD_DimDepartmentGroup_OLD', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS CREATE EXTERNAL TABLE TableB_SCD_DimDepartmentGroup_OLD WITH ( LOCATION = '/TableB_SCD_DimDepartmentGroup_OLD', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,0 ID_valid ,1 ID_Deleted ,From_date FROM [SCD_DimDepartmentGroup] WHERE NOT EXISTS ( SELECT 1 FROM OPENROWSET( BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder', FORMAT = 'PARQUET' ) AS [SCD_DimDepartmentGroup_Silver] WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey AND (ISNULL(SCD_DimDepartmentGroup_Silver.[ParentDepartmentGroupKey], 1) = ISNULL(SCD_DimDepartmentGroup.[ParentDepartmentGroupKey], 1) AND SCD_DimDepartmentGroup_Silver.[DepartmentGroupName] = SCD_DimDepartmentGroup.[DepartmentGroupName]) ) AND NOT EXISTS (SELECT 1 FROM TableA_SCD_DimDepartmentGroup_NEW WHERE TableA_SCD_DimDepartmentGroup_NEW.[DepartmentGroupKey] = SCD_DimDepartmentGroup.DepartmentGroupKey)

Results are Fig. 3 - Step 2 :

Fig. 3 - Step 2

3) Step 3 - Only updated values are to be versioned

TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS - The next external table will get the new rows from the source file. The filter will be the rows that we already know something changed and needs to be versioned from Step 2.

Code:

DECLARE @ID_Surr AS INT SELECT @ID_Surr = MAX (ID_Surr)+1 FROM SCD_DimDepartmentGroup CREATE EXTERNAL TABLE TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS WITH ( LOCATION = '/TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS SELECT @ID_Surr as ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,1 ID_valid ,0 ID_Deleted ,getdate() Curr_date FROM OPENROWSET( BULK 'https://Storage.blob.core.windows.net/Container/SCD/sourcefiles_folder/', FORMAT = 'PARQUET' ) AS [SCD_DimDepartmentGroup_Silver] WHERE EXISTS (SELECT 1 FROM TableB_SCD_DimDepartmentGroup_OLD WHERE SCD_DimDepartmentGroup_Silver.[DepartmentGroupKey] = TableB_SCD_DimDepartmentGroup_OLD.DepartmentGroupKey)

The results are in Fig. 4 - Step 3:

Fig. 4 - Step 3

4) Step 4 - Create another external table by Union

For all those external tables that were created on the last steps.

EndDatate should be null for new rows or new values versioned. If the value was versioned the respective row will contain until when it was valid.
ID_valid will be 1 if the row is not versioned, 0 if this row is no longer valid
ID_Deleted will be 0 if the row is not versioned, 1 if this row is no longer valid

CREATE EXTERNAL TABLE UNION_SCD_DimDepartmentGroup WITH ( LOCATION = '/UNION_SCD_DimDepartmentGroup', DATA_SOURCE = SCD_serveless_dim, FILE_FORMAT = Parquet_file ) AS SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,getdate() as From_date ,Null as End_date FROM TableA_SCD_DimDepartmentGroup_NEW UNION ALL SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,From_date as From_date ,Getdate() as End_date FROM TableB_SCD_DimDepartmentGroup_OLD UNION ALL SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,getdate() as From_date ,Null as End_date FROM TableC_SCD_DimDepartmentGroup_GOLD_OLD_INS

Results are - Fig. 5 - Union :

Fig. 5 - Union

Here we have the new rows, new versioned rows consolidated on the external table.

5) Step 5.

Next, we need to transfer this to our main table, by recreating an external table with the new information and dropping the old one.

1) I need the data from the main table excluding the rows that were versioned on this round

2) I need the data from the external tables Union.

3) I need to keep the historical changes of the old/main table

4) On top of that I will recreate the surrogate key. But for this, we will order per ID_Surr and DepartmentGroupKey. So, the new surrogate key values will start from the max surrogate+1 as we defined in the first Steps.

SELECT ROW_NUMBER () OVER (ORDER BY ID_Surr, DepartmentGroupKey) ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,From_date ,End_date FROM ( SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,From_date ,End_date FROM SCD_DimDepartmentGroup WHERE NOT EXISTS ( SELECT 1 FROM TableB_SCD_DimDepartmentGroup_OLD WHERE SCD_DimDepartmentGroup.DepartmentGroupKey = TableB_SCD_DimDepartmentGroup_OLD.DepartmentGroupKey ) UNION SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] , ID_valid ,ID_Deleted ,From_date ,End_date FROM UNION_SCD_DimDepartmentGroup UNION SELECT ID_Surr ,[DepartmentGroupKey] ,[ParentDepartmentGroupKey] ,[DepartmentGroupName] ,ID_valid ,ID_Deleted ,From_date ,End_date FROM SCD_DimDepartmentGroup WHERE SCD_DimDepartmentGroup.ID_valid = 0 )NEW_SCD

Results are in Fig. 6 - New Table. Please note in green there are the rows that are versioned, and in blue are the new rows:

Fig. 6 - New Table

This new table should replace the current SCD_DimDepartmentGroup, by recreating a new external table and dropping this old one.

Summary

Please review the considerations for the scenario to understand what was relevant for this implementation. As I mentioned earlier, there are different approaches that can be used to create a Slow Change Dimension, depending on how the data is handled at the source. In this post, I have provided examples of how to read new data, version the old data that has changed, and recreate new external tables with the updated information for a Slow Change Dimension type 2.

In the next post, I will demonstrate how to achieve the same using Spark. Stay tuned for more!

Liliam, UK.