New data flow functions for dynamic, reusable patterns

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

ADF has added columns() and byNames() functions to make it even easier to build ETL patterns that are reusable and flexible for generic handling of dimensions and other big data analytics requirements.

In this example below, I am making a generic change detection data flow that looks for changed column values by hashing the row. I can use static column names in the hashing function (I'm using sha2 in this case) or I can use columns() as a way to tell ADF to use all of the stream's incoming columns as a single function argument. This way, my pattern can be re-used with any source without needing to hardcode column names. This is all supported by the ADF schema drift feature and is achieved by using the new columns() function as the argument to the sha2() function:

sha2(256,columns())

In the second example above, I'm passing in a comma-separated list of values as a string that represent the specific columns that I wish to check for change detection. In this case, I don't want to use static columns and I don't want to look for changes from any column using columns() for the entire row. Instead, I want to parameterize the hash function. In this case, I use byNames() and split my string parameter to create an array of string column names that can be sent to the data flow activity at runtime, making the hashing dynamic.

sha2(256,byNames(split($cols,',')))

In this case, I'm going to hash the columns movies, title, and genres, and that will detect any changes to the values in those columns. This is achieved by using the new byNames() function as the argument to the sha2() hash function and using split() to create an array from the string parameter that contains column names.

Leave a Reply Cancel reply