How to infer automatically table schema in Azure Data Explorer and more

This post has been republished via RSS; it originally appeared at: Azure Data Explorer articles.

Querying external data is a relatively new capability in Azure Data Explorer, but it's clear already that the feature was awaited by many as it helps solve two very important data engineering challenges: data fusion and cost efficiency. Two new functions, which have been added recently to Azure Data Explorer toolbox make working with external data even more easier. the first infers external table schema she second visualizes what files are going to be processed.

 

The first one allows inferring external data schema, which comes in handy when creating an external table definition on data with unknown schema. How come the data schema is unknown, you might ask? It might be the case when data was created by a different team or even a different organization (see recent post by Amazon's CTO about challenges of managing Data Lakes in rapidly growing organizations). Otherwise, it may be a very silly case when ETL logic is so complex to reverse-engineer and the number of dataset columns is hundreds, so it's just easier to look at the data itself and discover its schema from there. The plug-in accepts an object containing storage connection URI and parameters:

 

clipboard_image_4.png

 

By default, only the latest written file will be read and its schema will be returned. This is enough for most scenarios as it's usually suggested to only expand schema when evolving it. There is an option to tell the plug-in to use any file or all files by setting "any" or "all" to Mode parameter respectively (use the latter setting with care, since scanning all files may become an extremely resource intensive operation depending on the number of files). The result contains a schema string that can be used when defining an external table:

 

clipboard_image_6.png

Please note that it's advisable to review inferred types, especially for data formats with no embedded schema like delimited text or JSON.

 

The second function helps visualize what files are going to be processed when querying an external table. This doesn't mean that all these files will be downloaded and read at query time, since some of them (preferably, most of them) will be skipped by query predicates. This tool is great for validating external table definitions:

 

clipboard_image_7.png

Hope you find these new tools helpful, and please let us know what capabilities you're looking for with regard to external data support in Azure Data Explorer!

REMEMBER: these articles are SYNDICATED. Your best bet to get a reply is to follow the link at the top of the post to the ORIGINAL post! BUT you're more than welcome to start discussions here:

This site uses Akismet to reduce spam. Learn how your comment data is processed.