How to Organize your Data Lake

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

 Data Lakes are one of the best outputs of the Big Data revolution, enabling cheap and reliable storage for all kinds of data, from relational to unstructured, from small to huge, from static to streaming. While on-prem implementations of this technology face administration and scalability challenges, public clouds made our life easier with data lakes as a service offers, like Azure Data Lake that has unlimited scalability and integrated security.  

 

But Data Lakes don’t enforce schema and can easily became Data Swamp, turning into an useless out of control data structure that won’t be as useful as it could until they are removed. In this blog post, you will see some suggestions to avoid this logical and organizational problem. 

 

The Challenge 

 

Let’s start from a hypothetical scenario where the Data Lake was already deployed and multiple teams are using it for mobile apps, logs managementIoT, EDWand other Big Data Scenarios. There are clients uploading images into intelligent agentsapplications streaming Terabytes of JSON files, multiple relational data silos sending data in CSV format, sensors sending IoT data and much more. The data sources are in different time zones and currencies. 

 

Still in our hypothetical scenario, let's say the volume of data is increasing without control and there are numerous folders (directories) within each other. Developers are lost unaware of what query engine to use, and the processes made so far are becoming slower and slower. And where to save the outputs? Load into a SQL database like Azure SQL DB? Or into an MPP database like Azure Synapse Analytics? Should the developers save the output of their transformations into the data lake itself? Using which format? 

 

These are the basis of a hypothetical swampy data. After a few months and you may have a perfect storm, when all analytics is compromised because of costs, performance, lost SLAs and wrong calculations. 

 

The Solution 

 

Data Governance tools like Azure Data Catalog can help tregister and access the data assets, but it is not enough to avoid a data swamp. There isn’t a single measure or tool to avoid all possible problems with your data lakewhile good practices will protect your environment from this kind of disaster. Here is a list of measures that you may take: 

 

Folders Structure 

 

Now let’s start some suggestions from our experience on implementing many data lake projects. The first point is to define a clear directories structure, that reflects its usage. Since a data lake is a distributed file system, everything will be a file within a folder.  In collaboration with all teams, you can try to create a layered structure like this one below 

 

Data Lake Layer 

Usage 

Expected Volume 

Path – Per Project 

Sub Folders (Granularity) 

Raw Files 

Files without any transformation, stored “as is”, every minute. This is the landing zone of your data lake 

  

~ TBs / day 

/project-name/raw-files 

/year/month/day/hour/minute 

Raw Data 

Now all files are in data queryable format: same time zone and currency. Special characters and duplicated were removed. Small files are merged into bigger files, what is a best practice for big data workloads. 

 

~ GBs / day 

/project-name/raw-data 

/year/month/day 

Business Data 

Raw data + business rules. Now you start to have the basic aggregations that will help all other analysis.  It is a good idea do use parallel processing on top of distributed file system to accomplish this heavy workload. This layer also can be used for data ingestion into DWs or DMs. 

 

~ MBs / day 

/project-name/business-data 

/year/month 

 

Some important points about the table above: 

  • Each layer is input of the next one. 
  • You may want to add the data source after the project name. 
  • All these folders will contain sub folders per timestamp. Granularity will decrease as you move the next layer.  This won’t be a problem for query engines since they will leverage the metastore that maps the root folder as a table or container. 
  • If anything goes out of control, you can easily identity where the problem is happening. You can also create jobs to check and log the folders size evolution. 

 

Files Format 

 

It is very useful to avoid different file formats or compression in the same folder. This will help the sanity of the developers and data engineers. Another table will help the organization: 

 

Data Lake Layer 

Files Format 

Compression 

Why 

Raw Files 

“as is” 

Gzip 

The same format of the original data, for fast data ingestion.  

 

Gzip will deliver good compression rate for most of the file types. 

 

Raw Data 

Sequence Files 

Snappy 

Sequence files are Hadoop flat files which stores values in binary key-value pairs. The sequence files are in binary format and these files can split, for faster processing. The main advantages of using sequence file is to merge two or more files into one file. And merging the raw files into bigger ones is one of the key gols of this layer. 

 

Snappy doesn’t give the best ratio, but it has excellent compress/decompress performance. 

Business Data 

Parquet Files 

Snappy 

For interactive queries using Presto, Impala, or an external connection like Polybase, that allows you to query external data from Azure Synapse Analytics. 

 

Snappy compression again will be used, parquet files are columnar, making them compressed by nature. The data volume isn’t a problem anymore. 

 

With a fixed organization like this, it is easy to determinate which tools should be used for data integration and analysis. And the tool used to access the data will define what kind of professionals will have access to each layer. This point is addressed in the next topic. 

 

Access Control 

 

Now let’s see how we can organize access to each layer. Again, let’s use a table for better visualization. 

 

Data Lake Layer 

Files Format 

Compatible Tools 

Access for 

Access Type 

How 

Raw Files 

“as is” 

Hive, Pig, Java, Python 

Data Engineers 

Read / Write 

Batch jobs 

Raw Data 

Sequence Files 

Hive, Python, Impala, Hive2, Drill, Pres 

Data Engineers, Data Scientists 

Read / Write 

Data Exploration activities, knowing that the data isn’t submitted for the business rules. The size of this layer and the state of the data make it unusable for data analysts or end users. 

Business Data 

Parquet Files 

Impala, Hive2, Drill, Presto, BI Tools, Polybase, SqoopAzure Data Factory 

Data Engineers, Data Scientists, Data Analysts 

Read Only 

Here the data is user friendly and the format is optimized for interactive queries.  

 

Modern SQL Databases, that allow external data access, can query this data for extra data integrations or for a virtual data warehouse. Polybase is the key tool to do it on Azure. External data will also be used to feed data marts with the SQL Database. 

 

Tools like Sqoop or ADF can be used to export the data also into SQL Databases. 

 

 

Conclusion 

 

If you create methods to enforce this big data architecture, most of the typical problems will be avoided. You can also create monitoring jobs to search and log problems, what allow you to keep record of the data state within your data lake. The human factor is decisive and methodologies like TDSP are useful to address conversations with the data science team. An open dialog with all stakeholders should be taken before, during, and after the processes implementations.  

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.