Connecting FHIR Data to Azure Databricks Delta Lake in Azure Health Data Services

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Authors: Mikael Weaver (Microsoft), Darnell Harris (Microsoft)

Co-authors: Bruce Nelson (Databricks), Dave Kelly (Databricks)

The HL7 Fast Healthcare Interoperability Resources (FHIR^®) standard is gaining acceptance and adoption across the Healthcare Industry as healthcare companies adopt and implement FHIR as a method of exchanging Protected Health Information (PHI) data. The industry is now looking for ways to conduct analytics and Machine Learning (ML) on top of this data.

Microsoft’s Azure Health Data Services allows for unification of healthcare data in the cloud to make PHI easier to exchange across the care continuum. Its goal is to standardize diverse data streams such as clinical, imaging, device, and unstructured data using FHIR, DICOM, and MedTech services.

The FHIR service within Azure Health Data Services is great for ingesting, persisting, and managing healthcare data in the native FHIR format. Although Azure Health Data Services is excellent for managing healthcare data, there are a few challenges customers run into when wanting to conduct analytics; nested schemas and data duplication, which require the customer to extract data to do analytics. FHIR to Data Lake OSS pipelines enables the ability to output data consumed by various analytical services.

Data Lakehouse and Delta Lake

What is a Data Lakehouse

A Data Lakehouse is an open data architecture that combines existing features from traditional data lakes and data warehouses. With Lakehouse you can store structured, semi-structured and unstructured data in your data lake and perform analytics, data science, machine learning etc. on one platform. Lakehouse is available on multiple data platforms including Azure Databricks; and Azure Synapse Analytics. Databricks is a popular data analytics platform that runs natively on Azure. Databricks has shifted focus from data lake to delta lake. To learn more about Lakehouse go to Databricks Lakehouse documentation.

Why Delta Lake

Delta Lake has emerged as the leading storage framework that enables building a Lakehouse architecture on top of existing data lake technologies (like Azure Data Lake Storage (ADLS). The primary components of Delta Lake are open source, though the platform itself is a first party service with Azure. Delta Lake solves common data lake pain points with ACID (atomicity, consistency, isolation and durability) transactions, unified streaming and batch ingestion, schema and governance, time travel, metadata handling and a unified view for records with a historical view. Delta Lake can be used to ensure a consistent level of data quality for analytical workloads. To learn more about Delta Lake, visit the Delta Lake Documentation Page.

Delta Lake for FHIR Data

Delta Lake solves a significant challenge is that FHIR bundles are semi structured, nested, and somewhat complex to consume and derive meaning. Delta Lake plays a critical role in allowing easy management of this data derivation of structure and meaning. As data is created in the FHIR Service and then periodically exported to data lake, all versions of the resources are exported, resulting in multiple versions which developers must filter to retrieve the most current record. Delta Lake enables the ability to mark changes in an update and get one version – the latest. The output of FHIR data to data lake typically needs processing before it can be integrated into analytics tools. FHIR data have typically nested schemas. This problem is solved with Delta Live Tables and Auto Load features in Azure Databricks. To learn more about Databricks on Azure, see the Delta Live Tables Introduction and Auto Loader.

Enabling Delta Lake with Azure Health Data Services

Referenced Architecture

The architecture below is an example of an ample data analytics flow from Azure Health Data Services using Azure Databricks to set up a Lakehouse using Delta Lake on top of Azure Data Lake Storage Gen 2. In the diagram, we leverage the FHIR to Data Lake open-source project to continually export data from a FHIR service. In addition, we are using Databricks pipelines and other features to easily setup a delta lake Lakehouse. This will enable big data analytics by any tool that works with delta lake, like Databricks Analytics, Synapse Analytics, and Power BI.

design-considerations-Overview.drawio.png

FHIR Data to Delta Lake (Auto Loader and Delta Live Tables)

When exporting FHIR data to Data Lake, the FHIR to Analytics OSS pipeline output files are presented in an open parquet format. Azure Databricks has native tools to easily ingest these parquet files into Delta Lake. First off, Delta Live tables provide a data processing pipeline framework to move data into a delta lake. Autoloader can be used inside of these pipelines to automatically feed data into your pipelines. Picture below references an example of this:

design-considerations-Auto Loader.drawio.png

Creating a Lakehouse with Delta Tables (Looking to Medallion Architecture)

A medallion architecture is a data design pattern used to logically organize data in a Lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as "multi-hop" architectures. To learn more, see the Databricks Glossary.

Creating a Medallion Architecture Lakehouse with Delta Live tables

Bronze

The Bronze layer is where you land data as is from source systems or raw data. Data from FHIR to Analytics pipeline will be moved to the bronze layer as is, next to data from other systems. Provided below is an example of how this will work in Delta Lake.

For FHIR data from FHIR to Data Lake, the bronze layer is a copy of this data extracted from the Data Lake path exported from your FHIR service. Once this data is moved successfully to bronze and you are confident in your Delta Lake deployment, you can begin to delete the data in this Data Lake path once it is moved to bronze. The bronze layer should be as close to the format of your source systems as possible.

Silver

The Silver layer is the cleaned and transformed data from your bronze layer to provide an enterprise-wide view of the entity. Data is inspected for a baseline of quality, for example, “does the patient have an identifier.” Data is also transformed in Silver, which usually means flattening from the nested FHIR JSON schema. These tables can be used for many different analytical and reporting use cases.

We recommend that you flatten the data in these tables as much as possible to simplify connecting downstream applications directly to your silver layer. FHIR data is heavily nested and it's best to incur the complexity of this transformation once in your data platform. For example, you may have column for your EMPI (enterprise master patient id) that is a flattened, filtered result from the identifier element on the patient resource.

Gold

The Gold builds on top of silver and is consumption-ready data for specific use cases. This data is generally aggregated and has less fidelity than the silver layer. Trend reporting and dashboards generally point to the gold layer.

In summary, creating your medallion architecture, data quality, and transformation should be driven by your institution's processes and procedures. Using the medallion architecture as a guide allows you to create an improved quality of tables for data out of FHIR. The need for gold and silver tables will be driven primarily by customer-defined needs; a "cleaned" dataset and "business level aggregates" which may differ from customer to customer. Please see the example below of medallion architecture utilizing FHIR data.

design-considerations-Medallion.drawio.png

Takeaways

Azure Health Data Services integrates with all common data platforms through open standards.
The Lakehouse paradigm provides a modern data analytics platform that promotes data reuse and self-service across organizations.
Azure Databricks provides an easy solution to setup your Lakehouse using data from your FHIR service.
For more information follow http://aka.ms/fhir-azuredatabricks-delta-lake-sample

Do more with your data with Microsoft Cloud for Healthcare

With Azure Health Data Services, health organizations can transform their patient experience, discover new insights with the power of machine learning and AI, and manage PHI data with confidence. Enable your data for the future of healthcare innovation with Microsoft Cloud for Healthcare.

We look forward to being your partner as you build the future of health.

Learn more about Azure Health Data Services.
Learn more about Microsoft Cloud for Healthcare.

®FHIR is a registered trademark of Health Level Seven International, registered in the U.S. Trademark Office, and is used with their permission.

Leave a Reply Cancel reply