This post has been republished via RSS; it originally appeared at: Azure Data Explorer Blog articles.
After the announcement of the integration of Azure Data Explorer into the recently launched Azure Purview service, this blog post shows how to use ADX as a data source in Purview and highlights the new features in a few use cases that can be helpful in the context of ADX.
Read the announcement blog post about our integration with Purview here.
Setup: How to connect ADX as a data source in Azure Purview
Azure Data Explorer is available as a data source to scan within Azure Purview Studio. Just register a new source and choose “Azure Data Explorer (Kusto)” and the according cluster connection information.
In order to connect to an ADX cluster you need to provide a service principal with “AllDatabaseViewer” rights on the target cluster/databases to Azure Purview through an Azure KeyVault.
After that, register ADX as a data source in Azure Purview Studio. Go through the process of managing the credentials, registering a new data source and configuring the first scan, as described here.
Then select all the databases that should be included in this specific scan. For example you can set up a full scan for all databases that should be done monthly or weekly, and set up another scan on specific databases that runs daily to keep the information in the data catalog up-to-date (full and incremental scans).
Next you can specify which rule set you want to apply for this scan. There is a system default for every data source, which includes all default classifications.
The power of custom classifications and scans
In the context of Azure Data Explorer, this can be very useful for a variety of use-cases:
Custom rules can be helpful for example if you want to identify custom part numbers that you might ingest with your IoT Telemetry or Log data, or other patterns that might help you identify certain attributes in tables specific for your business domain. These classifications will then be attributes of the tables/columns in Purview and users will be able to search for example for tables containing data about specific device families, product lines or production processes in your data catalogue.
Classifications can also be applied manually after a scan directly within Azure Purview on the relevant data assets, like on a table or column. For example you could highlight specific columns in a dataset that you know is used to measure customer interaction, when is a feature used and how long does it take a customer to get there. In combination with the Business Glossary in Purview these additional attributes can significantly improve the search and discovery experience for many user groups within a company. Business analysts can leverage this customer interaction data and see if there's a correlation to any metric they might use.
If you give good thought about which classifications and business terms might be useful, you can make your IoT-, factory floor and device telemetry data much more accessible, democratizing access to these data that historically are often siloed within manufacturing systems.
You can always look at an overview of the recent scans in the run history:
Browsing the data catalog: ADX data assets
After the scan(s) have finished, you can start browsing the newly discovered data assets. You can do that either by directly searching for a specific term (part of a table name, classification, etc.) or by clicking “Browse assets” in the main menu.
In the case of ADX, an overview of the registered data assets can look something like this:
Looking at an ADX table for example we get detailed information as to which database it belongs to and about the cluster this database is running on. We also see when it was last changed, we can add a description to the table and see/add classifications for easier discovery, as well as some ADX specific properties like a potential folder or the docstring.
Also as mentioned you can add descriptions as well as classifications and associate terms from the business glossary with every ADX data asset, visible on the bottom right here:
The scans also pick up the table schema, showing all columns and their respective data type. And while giving the data consumers in your company the ability to discover data easily is very important, here you can also add a contact person, an owner, that people can talk to to learn more about the data asset, how to get access and how to use it.
Visualizing the data flows - data lineage information
A very powerful piece of information is located in the “Lineage” tab. This overview shows you the data flow between the assets in your data catalog. In the context of ADX this currently means that every data movement that you defined using Azure Data Factory involving an ADX table will be visualized in this tab.
In addition to the automatically inferred lineage information from Azure Data Factory, we saw that many customers also use custom scripts for data ingestion, or Jupyter notebooks on a Spark cluster, or they ingest data programmatically using one of the ADX SDKs provided. In this case, to make these data flows transparent as well, you can use the well documented Apache Atlas REST API to create custom objects within Azure Purview.
You can find all the details about using the Azure Purview REST API here, including a sample postman collection to get you started.
An example: Visualize custom data flows within Azure Purview lineage
One feature within ADX that many customers use very frequently is the concept of Update Policies. When using them, you essentially chain two or more tables together and transform data between them in some form. These data transformations can be linear, e.g. from Table A -> Table B -> Table C, or they can "fan out", like filtering data per device family, e.g. from Table A -> Table B, as well as Table A -> Table C.
What that means is that update policies might be a very good candidate for visualization in the lineage tab. In order to achieve that, we use the Atlas API to create a new "Process" entity in Purview called "ADX Update Policy" with all the attributes it might have within ADX.
After the creation of this asset, all we need to do to use it in the lineage tab, is to link two tables together using our newly created "Update Policy" object. Now in order to do that, we first need to fetch the GUIDs of the according ADX tables from Azure Purview, so that the tool is actually able to uniquely identify them. For the sake of this blog post we assume that we looked them up manually, but you can of course also discover them via REST API Call (hint: /api/atlas/v2/search/advanced)
After we have all the information, the body of the API call could look like this:
The result then can look something like this:
Of course these are only basic examples and they involve some scripting as well as orchestration of API calls and other sorts of automation, but you can be sure that we are hard at work with the Purview team to extend the automated lineage information to more data ingestion and analytics scenarios relevant for ADX, stay tuned for more.