Introducing Automatic Schema Discovery with auto table creation for complex datatypes

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Introduction

In this blog, we detail how GitHub leveraged the with automatic creation of table feature to ingest complex data types present in Parquet file and ultimately improving efficiency and performance with COPY command.

 

Azure Synapse Analytics brings the worlds of data integration, big data, and enterprise data warehousing together into a single service for end-to-end analytics, at cloud scale. The ever-increasing volume and variety of data has led to introductions of new data and file types. Apache Parquet is a columnar file format that is efficient and provides optimizations to speed up queries. Parquet supports nested datatypes processing huge variety of data. The It does not need strict ‘CONTROL’ permissions on the data warehouse to load the data.

 

Depending on the workload, the file splitting feature of COPY Command provides better performance for ingesting the data.

 

On 23rd September 2020, Microsoft announced the Automatic Schema discovery within the COPY command, which gives customers the option to perform automatic table creation. Customers using the automatic schema discovery command can now ingest complex data types - a capability which was not previously easy to achieve. Complex data types that are ingested using the 'auto_create_table' flag in the COPY command are mapped to varchar(max) columns upon ingestion.

 

The whole process of defining and mapping source data into target tables is a cumbersome process, especially when tables contain large number of columns. Automatic schema discovery simplifies the process of defining and mapping source data into target tables. Automatic schema discovery works alongside with auto-table creation process. When used, the system automatically creates the target table based on the schema of the Parquet file by identifying the first and last alpha numerical files within the list.

 

With the increase in variety of data, there was a need for COPY command to support complex data types. Some examples of complex data types include arrays, maps etc. As the COPY command provides the flexibility and performance, COPY command can now recognize and ingest complex data types like maps and arrays.

 

The structured rows of Parquet files are organized into Rowgroups. A Rowgroup is a group of rows that are compressed into the Columnstore format. Columnstore is data that is logically organized as a table with rows and columns, and physically stored in a column-wise data format. The customers had to choose the ‘COMPRESSION’ option manually. With the feature of Auto Compression, the customers need not choose the ‘COMPRESSION’ option, the auto-table creation process automatically discovers and handles the input files that qualify the Rowgroup compression.

 

A few customers observed a substantial performance improvement with all the easy-to-use automatic schema discovery, Automatic table creation and Automatic compression feature.

 

Use Case

 

GitHub

GitHub is the world’s leading developer platform. As the home to more than 73 million developers across the globe, GitHub is where developers can create, share, and ship the best code possible through a single, seamless experience.

 

Goal for GitHub

With the robust security and performance that Dedicated SQL pools within Azure Synapse Analytics offers along with Power BI integration, GitHub wanted to add Dedicated SQL pools within Azure Synapse Analytics as data frontend for their Data Warehouse alongside Trino/Hive(formerly Presto SQL).       

 

 GitHub Architecture:   

ajagadish_1-1642809220139.png

 

1. The parquet files containing complex data types created using Trino/Hive are stored in Azure Data Lake Storage (ADLS). Most of these parquet files include attributes with complex datatypes like maps and arrays. Examples of data include:

  • Labels and assignees for a support issue is stored as an array
  • Details of A/B experiments is stored as an array of maps

2. Copy command is used to Ingest data from Azure Data Lake Storage (ADLS) into Dedicated SQL pools within Azure Synapse Analytics.  The complex data types are then parsed through OpenJSON(..) functions. The data is then made available for reporting.

 

Note: GitHub is currently migrating from Trino/Hive to Dedicated SQL Pools within Azure Synapse Analytics. The above is a combination of current and future state architecture.

 

Challenge:

GitHub analyses and derives insights from the various customer support tickets that it receives. The various categories of the support tickets are tagged uniquely creating unlimited number of tags. The tags are stored in array data format. The customers would want to analyze the data associated with a particular kind of tag.

 

As GitHub stores data in various complex data types – array being one of those, they have a need to analyze and process these complex data types so that the data is made available to customers for analysis and to derive insights.

 

During the data ingestion process into the Dedicated SQL pools within Azure Synapse Analytics, Polybase did not support the complex data types such as maps and arrays. This avoided ingesting complex data types due to lack of support.

 

Solution:

Automatic schema discovery along with automatic creation of table feature was used, and complex Parquet datatypes were automatically mapped to varchar columns. This is embedded in the COPY Command. JSON functions are then used to parse the attributes of complex data types.

 

Feature details:

In ‘COPY’ statement when ‘Auto Create Table’ flag is turned on, the Data Ingestion process will create a new table automatically by discovering the structure of the parquet file through identifying the first and last alpha numerical files within the list. 

 

T-SQL Syntax:

COPY INTO [table] FROM 'location' WITH (..., AUTO_CREATE_TABLE='On') 

 

ajagadish_2-1642809418810.png

Above screen-print depicts the new complex array datatype from parquet file that we are now supporting.

 

Azure Portal- Azure Synapse Analytics

Synapse Pipeline

 

ajagadish_3-1642809474644.png

 

Conclusion:

 This Automatic Schema discovery along with auto-table creation process makes it easy for customers to automatically map and load complex data types present in Parquet files such as arrays and maps into the Dedicated SQL pools within Azure Synapse Analytics. Rowgroup compression is automatically enabled when customers enable the auto-create table option within the COPY command. Start taking advantage of all these features today to simplify data ingestion with Azure Synapse Analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.