End-to-end Fabric Git Repo for 220M rows of CMS data

Posted by

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Microsoft Fabric is uniting data and analytics tools within a single SaaS platform. Fabric encompasses many different user personas and tools, and new users are seeking opportunities to skill up. This article reviews a new Github repository that will allow anyone with access to Fabric to deploy an end-to-end solution in Fabric that leverages 220 million rows of real healthcare open data from CMS. Without having to code, users can follow the instructions in the Git repo to import the data into OneLake, serve it up in the Lakehouse, and then query it from Power BI and Excel using the new Direct Lake connector. Below is an architectural diagram of the solution:




Here is a link to the Git repo: fabric-samples-healthcare/analytics-bi-directlake at main · isinghrana/fabric-samples-healthcare (github.com)


This is the first release for the Github repository which will be a hub for new easy-to-deploy Fabric healthcare solutions moving forward. In the diagram above, the simple steps of the solution are shown:

  1. Download the files from CMS and Upload them to Fabric OneLake
  2. Combine the files into a single table in the Fabric Lakehouse using delta parquet file format. Don't worry, you can deploy the Spark Notebook without having to write code!
  3. Create a Fabric Direct Lake dataset that queries the table without caching any of the 220M+ rows. 
  4. Create reports in Power BI and Excel that query the Lakehouse with impressive query performance.

The data used in the solution is real CMS open data for Medicare Part D Prescribers - By Prescriber and Drug. The data details drug names, physician names, geographical data, costs, beneficiary counts, and more. The data spans from 2013 to 2021, and totals over 220 million rows. 




This solution was created by Greg Beaumont and Inder Rana, who are Data & AI Technical Specialists for Microsoft Healthcare and Life Sciences:


Inder Rana

Linkedin: https://www.linkedin.com/in/singhinderjit

Blog: https://isinghrana.medium.com/


Greg Beaumont

Linkedin: https://www.linkedin.com/in/gregbeaumont 

Twitter: https://twitter.com/grbeaumont 


Future planned releases for this GitHub repo include easy-to-deploy healthcare solutions such as:

  • Machine Learning and Predictive Analytics within Fabric
  • Comparing Direct Lake, SQL Serverless Endpoint, and Import models for Power BI
  • Comparing query performance for flattened data models versus star schemas and composite models
  • OpenAI integration with Fabric
  • Pass along your ideas and suggestions!

Here's a few of the instructional videos from the Git Repo tutorial:


Import the files manually into Fabric OneLake


Use a Fabric Spark Notebook to create a table in the Lakehouse in delta parquet format


Create a Fabric Power BI dataset in Direct Lake mode to query 220M+ rows of data without caching


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.