HowTo: Azure Data Explorer integration into Azure Purview

This post has been republished via RSS; it originally appeared at: Azure Data Explorer articles.

After the announcement of the integration of Azure Data Explorer into the recently launched Azure Purview service, this blog post shows how to use ADX as a data source in Purview and highlights the new features in a few use cases that can be helpful in the context of ADX. 

 

Read the announcement blog post about our integration with Purview here. 

 

Setup: How to connect ADX as a data source in Azure Purview

 

Azure Data Explorer is available as a data source to scan within Azure Purview Studio. Just register a new source and choose “Azure Data Explorer (Kusto)” and the according cluster connection information. 

 

Image1.JPG

 

In order to connect to an ADX cluster you need to provide a service principal with “AllDatabaseViewer” rights on the target cluster/databases to Azure Purview through an Azure KeyVault

 

After that, register ADX as a data source in Azure Purview Studio. Go through the process of  managing the credentials, registering a new data source and configuring the first scan, as described here.

 

Then select all the databases that should be included in this specific scan. For example you can set up a full scan for all databases that should be done monthly or weekly, and set up another scan on specific databases that runs daily to keep the information in the data catalog up-to-date (full and incremental scans). 

 

Next you can specify which rule set you want to apply for this scan. There is a system default for every data source, which includes all default classifications.  

 

Image3.JPG

 

The power of custom classifications and scans

 

You can create your own scan rule sets to include only the scan rules that are relevant for your data or include your own custom rules that you defined in Purview

 

In the context of Azure Data Explorer, this can be very useful for a variety of use-cases: 

Custom rules can be helpful for example if you want to identify custom part numbers that you might ingest with your IoT Telemetry or Log data, or other patterns that might help you identify certain attributes in tables specific for your business domain. These classifications will then be attributes of the tables/columns in Purview and users will be able to search for example for tables containing data about specific device families, product lines or production processes in your data catalogue.

 

Classifications can also be applied manually after a scan directly within Azure Purview on the relevant data assets, like on a table or column. For example you could highlight specific columns in a dataset that you know is used to measure customer interaction, when is a feature used and how long does it take a customer to get there. In combination with the Business Glossary in Purview these additional attributes can significantly improve the search and discovery experience for many user groups within a company. Business analysts can leverage this customer interaction data and see if there's a correlation to any metric they might use.

 

If you give good thought about which classifications and business terms might be useful, you can make your IoT-, factory floor and device telemetry data much more accessible, democratizing access to these data that historically are often siloed within manufacturing systems.

 

You can always look at an overview of the recent scans in the run history: 

 

Image4.JPG

 

 

Browsing the data catalog: ADX data assets

 

After the scan(s) have finished, you can start browsing the newly discovered data assets. You can do that either by directly searching for a specific term (part of a table name, classification, etc.) or by clicking “Browse assets” in the main menu. 

 

In the case of ADX, an overview of the registered data assets can look something like this: 

 

Image5.JPG

 

Looking at an ADX table for example we get detailed information as to which database it belongs to and about the cluster this database is running on. We also see when it was last changed, we can add a description to the table and see/add classifications for easier discovery, as well as some ADX specific properties like a potential folder or the docstring. 

 

Also as mentioned you can add descriptions as well as classifications and associate terms from the business glossary with every ADX data asset, visible on the bottom right here:

 

Image6.JPG

 

The scans also pick up the table schema, showing all columns and their respective data type. And while giving the data consumers in your company the ability to discover data easily is very important, here you can also add a contact person, an owner, that people can talk to to learn more about the data asset, how to get access and how to use it.

 

Visualizing the data flows - data lineage information

 

A very powerful piece of information is located in the “Lineage” tab. This overview shows you the data flow between the assets in your data catalog. In the context of ADX this currently means that every data movement that you defined using Azure Data Factory involving an ADX table will be visualized in this tab. 

 

Image7.JPG

 

In addition to the automatically inferred lineage information from Azure Data Factory, we saw that many customers also use custom scripts for data ingestion, or Jupyter notebooks on a Spark cluster, or they ingest data programmatically using one of the ADX SDKs provided. In this case, to make these data flows transparent as well, you can use the well documented Apache Atlas REST API to create custom objects within Azure Purview.

 

You can find all the details about using the Azure Purview REST API here, including a sample postman collection to get you started.

 

 

An example: Visualize custom data flows within Azure Purview lineage

 

One feature within ADX that many customers use very frequently is the concept of Update Policies. When using them, you essentially chain two or more tables together and transform data between them in some form. These data transformations can be linear, e.g. from Table A -> Table B -> Table C, or they can "fan out", like filtering data per device family, e.g. from Table A -> Table B, as well as Table A -> Table C.

 

What that means is that update policies might be a very good candidate for visualization in the lineage tab. In order to achieve that, we use the Atlas API to create a new "Process" entity in Purview called "ADX Update Policy" with all the attributes it might have within ADX.

 

 

 

REST API Call (POST) against /api/atlas/v2/types/typedefs of our Purview instance: { "entityDefs" : [ { "superTypes" : [ "Process" ], "category" : "ENTITY", "name" : "adx_update_policy", "description" : "a type definition for azure data explorer update policies", "typeVersion" : "1.0", "attributeDefs" : [ { "name" : "IsEnabled", "typeName" : "string", "isOptional" : true, "cardinality" : "SINGLE", "valuesMinCount" : 1, "valuesMaxCount" : 1, "isUnique" : false, "isIndexable" : false }, {<additional attributes..>} ] } ] }

 

 

 

After the creation of this asset, all we need to do to use it in the lineage tab, is to link two tables together using our newly created "Update Policy" object. Now in order to do that, we first need to fetch the GUIDs of the according ADX tables from Azure Purview, so that the tool is actually able to uniquely identify them. For the sake of this blog post we assume that we looked them up manually, but you can of course also discover them via REST API Call (hint: /api/atlas/v2/search/advanced)

.

After we have all the information, the body of the API call could look like this:

 

 

REST API Call (POST) against /api/atlas/v2/entity/bulk of our Purview instance: { "entities": [ { "typeName": "adx_update_policy", "createdBy": "admin", "attributes": { "qualifiedName": "adx_update_policy", "uri": "adx_update_policy", "name": "adx_update_policy", "description": "transforms data between source and target table", "IsEnabled": "true", "Query": "KQLTransformationQuery()", "IsTransactional": "false", "PropagateIngestionProperties": "false", "inputs": [ { "guid": "<Purview ID of the source table>", "typeName": "azure_data_explorer_table" } ], "outputs": [ { "guid": "<Purview ID of the target table>", "typeName": "azure_data_explorer_table" } ] } } ] }

 

 

The result then can look something like this:

 

Image8.JPG

 

Of course these are only basic examples and they involve some scripting as well as orchestration of API calls and other sorts of automation, but you can be sure that we are hard at work with the Purview team to extend the automated lineage information to more data ingestion and analytics scenarios relevant for ADX, stay tuned for more.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.