How to Organize your Data Lake

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Data Lakes are one of the best outputs of the Big Data revolution, enabling cheap and reliable storage for all kinds of data, from relational to unstructured, from small to huge, from static to streaming. While on-prem implementations of this technology face administration and scalability challenges, public clouds made our life easier with data lakes as a service offers, like Azure Data Lake that has unlimited scalability and integrated security.

But Data Lakes don’t enforce schema and can easily became a Data Swamp, turning into an useless out of control data structure that won’t be as useful as it could until they are removed. In this blog post, you will see some suggestions to avoid this logical and organizational problem.

The Challenge

Let’s start from a hypothetical scenario where the Data Lake was already deployed and multiple teams are using it for mobile apps, logs management, IoT, EDW, and other Big Data Scenarios. There are clients uploading images into intelligent agents, applications streaming Terabytes of JSON files, multiple relational data silos sending data in CSV format, sensors sending IoT data and much more. The data sources are in different time zones and currencies.

Still in our hypothetical scenario, let's say the volume of data is increasing without control and there are numerous folders (directories) within each other. Developers are lost unaware of what query engine to use, and the processes made so far are becoming slower and slower. And where to save the outputs? Load into a SQL database like Azure SQL DB? Or into an MPP database like Azure Synapse Analytics? Should the developers save the output of their transformations into the data lake itself? Using which format?

These are the basis of a hypothetical swampy data. After a few months and you may have a perfect storm, when all analytics is compromised because of costs, performance, lost SLAs and wrong calculations.

The Solution

Data Governance tools like Azure Data Catalog can help to register and access the data assets, but it is not enough to avoid a data swamp. There isn’t a single measure or tool to avoid all possible problems with your data lake, while good practices will protect your environment from this kind of disaster. Here is a list of measures that you may take:

Folders Structure

Now let’s start some suggestions from our experience on implementing many data lake projects. The first point is to define a clear directories structure, that reflects its usage. Since a data lake is a distributed file system, everything will be a file within a folder. In collaboration with all teams, you can try to create a layered structure like this one below.

Data Lake Layer	Usage	Expected Volume	Path – Per Project	Sub Folders (Granularity)
Raw Files	Files without any transformation, stored “as is”, every minute. This is the landing zone of your data lake	~ TBs / day	/project-name/raw-files	/year/month/day/hour/minute
Raw Data	Now all files are in data queryable format: same time zone and currency. Special characters and duplicated were removed. Small files are merged into bigger files, what is a best practice for big data workloads.	~ GBs / day	/project-name/raw-data	/year/month/day
Business Data	Raw data + business rules. Now you start to have the basic aggregations that will help all other analysis. It is a good idea do use parallel processing on top of distributed file system to accomplish this heavy workload. This layer also can be used for data ingestion into DWs or DMs.	~ MBs / day	/project-name/business-data	/year/month

Some important points about the table above:

Each layer is input of the next one.
You may want to add the data source after the project name.
All these folders will contain sub folders per timestamp. Granularity will decrease as you move the next layer. This won’t be a problem for query engines since they will leverage the metastore that maps the root folder as a table or container.

If anything goes out of control, you can easily identity where the problem is happening. You can also create jobs to check and log the folders size evolution.

Files Format

It is very useful to avoid different file formats or compression in the same folder. This will help the sanity of the developers and data engineers. Another table will help the organization:

Data Lake Layer	Files Format	Compression	Why
Raw Files	“as is”	Gzip	The same format of the original data, for fast data ingestion. Gzip will deliver good compression rate for most of the file types.
Raw Data	Sequence Files	Snappy	Sequence files are Hadoop flat files which stores values in binary key-value pairs. The sequence files are in binary format and these files can split, for faster processing. The main advantages of using sequence file is to merge two or more files into one file. And merging the raw files into bigger ones is one of the key gols of this layer. Snappy doesn’t give the best ratio, but it has excellent compress/decompress performance.
Business Data	Parquet Files	Snappy	For interactive queries using Presto, Impala, or an external connection like Polybase, that allows you to query external data from Azure Synapse Analytics. Snappy compression again will be used, parquet files are columnar, making them compressed by nature. The data volume isn’t a problem anymore.

With a fixed organization like this, it is easy to determinate which tools should be used for data integration and analysis. And the tool used to access the data will define what kind of professionals will have access to each layer. This point is addressed in the next topic.

Access Control

Now let’s see how we can organize access to each layer. Again, let’s use a table for better visualization.

Data Lake Layer	Files Format	Compatible Tools	Access for	Access Type	How
Raw Files	“as is”	Hive, Pig, Java, Python	Data Engineers	Read / Write	Batch jobs
Raw Data	Sequence Files	Hive, Python, Impala, Hive2, Drill, Pres	Data Engineers, Data Scientists	Read / Write	Data Exploration activities, knowing that the data isn’t submitted for the business rules. The size of this layer and the state of the data make it unusable for data analysts or end users.
Business Data	Parquet Files	Impala, Hive2, Drill, Presto, BI Tools, Polybase, Sqoop, Azure Data Factory	Data Engineers, Data Scientists, Data Analysts	Read Only	Here the data is user friendly and the format is optimized for interactive queries. Modern SQL Databases, that allow external data access, can query this data for extra data integrations or for a virtual data warehouse. Polybase is the key tool to do it on Azure. External data will also be used to feed data marts with the SQL Database. Tools like Sqoop or ADF can be used to export the data also into SQL Databases.

Conclusion

If you create methods to enforce this big data architecture, most of the typical problems will be avoided. You can also create monitoring jobs to search and log problems, what allow you to keep record of the data state within your data lake. The human factor is decisive and methodologies like TDSP are useful to address conversations with the data science team. An open dialog with all stakeholders should be taken before, during, and after the processes implementations.