Control PII and Sensitive Data Risk for Self-Service BI using Power BI DataFlows and Azure Data Lake

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

IMAGE 1.pngPower BI DataFlows Integrates Directly with an Azure Data Infrastructure

 

For companies operating in highly regulated industries such as Healthcare, the promise of self-service Business Intelligence often takes a back seat to regulatory concerns about sensitive data such as Personally Identifiable Information (PII). Healthcare companies require capabilities to control the flow of sensitive data for both enterprise and self-service Business Intelligence. This article will review strategies for controlling access to sensitive data while still empowering users to gain value from Microsoft Business Intelligence and Analytics tools.

 

This article is the second in a series exploring how Power BI paired with Azure data tools creates a flexible, scale-able, and achievable healthcare analytics architecture:

  • #1 - Unleash Massive Healthcare Data Volumes to Analytics using Power BI Aggregations  - Click Here!
  • #2 - Control PII and Sensitive Data Risk for Self-Service BI using Power BI DataFlows and Azure Data Lake (this article)
  • #3 - Control PII and Sensitive Data Risk for Self-Service BI using Power BI Data Protection (Coming Soon)
  • #4 - Power BI with Azure SQL Data Warehouse enables Advanced Row Level Security with Variable Access at both the Summary and Detail Levels (Coming Soon)

Terms often associated with sensitive data include PII, PHI (Protected Health Information), and PIFI (Personally Identifiable Financial Information). Data that could be used for unfair financial market trades, often referred to as “insider information,” is also a consideration when granting users access to data. I am not an expert on these laws and the specifics of the associated requirements, but the tools and techniques below will hopefully provide value as you consider a plan for managing sensitive data.

 

Protecting and De-Identifying sensitive data goes beyond the simple removal of names, addresses, social security numbers, etc. The challenge of minimizing sensitive data risk becomes more complex when using Business Intelligence and Analytics tools. Here’s a few examples:

  • Healthcare
    • Data that seems impersonal can sometimes be used to deduce identities. “Re-Identification” is a hot topic in Healthcare usually referring to methods where outside information and/or data mining techniques can be used to identify individual patients. Data in Business Intelligence Models is organized, compressed, and cleaned up in a manner that is prime for Re-Identification.
    • “Re-sharing” reports can transfer visibility from an approved to an unapproved viewer.
    • Rare disorders and characteristics can be used to identify people. (i.e. one person in the whole State has a rare genetic disorder, and the ICD10 for that disorder is on a report)
  • Financial & More
    • When viewing Sales or Financial data, can you deduce the next Quarter’s earnings? Who should have access to what level of detail?
    • “Exporting Data from Reports” takes data out of pre-defined security constraints.
    • Banking and Financial Data – How do you report without exposing Customers who can be identified by their Unique Investment Profiles?

Here’s a few examples of how sensitive PII might get accidentally shared with the wrong person:

 

IMAGE 2.pngExamples of how Accidental PII Sharing can Happen

IMAGE A

  • A.1 - Example #1 (above) – User A exports data to Excel, a flat file, a document, or into another file format. User A could then share that file with User B, who wasn’t supposed to see that data.
    A.2 - Example #2 – User A builds a report for a Group of users who have permission to see the PII. User C re-shares that report with User B, not knowing that User B was not supposed to see that data.
    A.3 - Example #3 – User A scrubs obvious PII from data that is part of an enterprise reporting Data Model. User B is aware that there is only one patient in a specific zip code, and they see data for that zip code. This is still a PII violation.

When sensitive data is shared with approved users, there are tools and techniques which can help minimize the risk of accidental sharing. New Power BI Data Protection capabilities, Administrative Settings, Power BI DataFlows with Azure Data Lake, and Row Level Security can help simplify security and access to sensitive data.

 

Power BI Data Protection
Microsoft Information Protection and Cloud App Security tools are now built into Power BI. These capabilities are game-changers for the Business Intelligence field, and were just recently announced at Ignite. New capabilities include Security for data after exporting it from Power BI, and device-based Security. Example A.1 above is directly impacted by these new features. Click here for more details. I will cover Data Protection in the next article of this series.

 

Power BI Administrative Settings
There are several settings in the Power BI Administrative Portal that can impact access to PII. Click here for a comprehensive list.  A few I’d recommend understanding in detail for an enterprise deployment requiring PII governance include:

  • Create Workspaces
  • Use Datasets Across Workspaces
  • Enable Microsoft Information Protection Security Labels
  • Set Sensitivity Labels for Power BI Content and Data
  • Share Content with External Users
  • Publish to Web (most customers turn this off)
  • Export Data
  • Export Reports as PowerPoint Presentations or PDF Documents
  • Print Dashboards and Reports
  • Allow External Guest Users to Edit and Manage Content in the Organization
  • Email Subscriptions
  • Use Analyze in Excel with On-Premises Datasets

Data Flows and Azure Data Lake
What is DataFlows? DataFlows is a self-service ETL/ELT tool in Power BI that is easy to use with a low-code/no-code interface. It also allows Power BI Administrators to control data available to self-service Power BI Model architects. Here is a comparison of DataFlows and enterprise ETL/ELT tools:

 

IMAGE 3.pngDifferences between Power BI DataFlows and Enterprise ETL/ELT tools

 IMAGE B

 

So how do Power BI DataFlows and Azure integrate for Sensitive Data Access Control? In the slide below, notice that the functional components of Power BI exist within the same secure Azure tenant as Azure Data Lake. There are also other tools available in Azure for enterprise grade ETL/ELT, Data Science / ML Projects, and for scaling up very large databases using Azure Synapse Analytics. It is important to note that at the time of writing this article, Power BI DataFlows and Azure Data Lake integration is still in Preview:

 

IMAGE 1.pngPower BI DataFlows Integrates Directly with an Azure Data Infrastructure

IMAGE C

 

Let’s review key features having enumerated green arrows above:
C.1 - Low-Code / No-Code ETL/ELT - Power BI Pro users can create low-code/no-code ETL/ELT packages using DataFlows. Pro users can also use existing DataFlows as tables of data to build Power BI Models. By default, DataFlows are stored in a hidden SaaS Azure Data Lake that is part of Power BI. If you choose to add your own Azure Data Lake, you can access the content that has been loaded into it by DataFlows. Once in Azure Data Lake, data can be used in DataBricks, ETL/ELT tools, Azure databases, and third party applications outside of Azure. As a result DataFlows does not trap your data in Power BI, and you can use those tables of data anywhere.
C.2 - Azure ML Integration - DataFlows also has native integration with Azure ML. If your Data Science team publishes Models to Azure ML, DataFlows users can use those ML models to score tables of data during scheduled refreshes. The integration is point-and-click with no need to write code. Azure Cognitive Services can also be integrated in a similar way. 
C.3 - Open Platform for Third Party Connectivity – Third Party tools can pull data from Azure Data Lake, and connect to Power BI Data Models in Premium just like they can with Analysis Services.
C.4 - Open Platform for Third Party Reporting - With Power BI, end users can build reports with the data visualization tool of their choice on top of those Premium Power BI Data Models.

 

So how can DataFlows help control access to PII? Consider the scenario in the following diagram:

 

IMAGE 4.pngManage PII with DataFlows and Azure Data Lake

IMAGE D

  • D.1 DataFlow 1 – This DataFlow contains PII that is intended for a select audience. User A is allowed to view this data when it is in an App, but is not allowed to use it for building new self-service Models. User B is not allowed to see data from this DataFlow anywhere.
  • D.2 DataFlow 2 – DataFlow 2 contains data scrubbed of all PII that can be viewed by both User A and User B. Both User A and User B can use this DataFlow to build new self-service Power BI Models.
  • D.3 DataFlow 1 Workspace (circled in purple) - User A and User B are not part of the Workspace containing DataFlow 1, so they cannot access the DataFlow to build new Models.
  • D.4 Power BI Model Workspace (circled in purple) - DataFlow 1 is used to build a Power BI Model stored in a separate Workspace. User A has privileges to use the App from this Workspace, and therefore can view the data. User B does not have permission to view this Model, and is blocked from viewing reports that use the Model as a source.

In the example above, User A Can View the App containing PII but Cannot make their Own Report or Re-share the App. User B Cannot View Anything from DataFlow 1.

 

A few considerations when using DataFlows with Azure Data Lake:

 

IMAGE 5.pngPower BI DataFlows with Azure Data Lake

IMAGE E

 

  • In the image above, note that Administrators with access to the Azure Data Lake can see all of the data from Power BI DataFlows. If there is PII in the DataFlows, Data Lake Administrators will have global access to that data.
  • Keeping DataFlows and Power BI Models in separate Workspaces allows you to restrict the level of access to PII. Someone with access to a DataFlow can blend that data with other sources, and share it with unapproved users. DataFlows containing PII should only be available to trusted and trained Power BI Model architects. Once data is in a Model, settings and permission levels can then minimize the risk of data from that Model ending up in another Model that has different users given permission to view it.

The next two articles in this series will also focus on capabilities that enable secure use of PII in an enterprise and self-service Business Intelligence environment:

 

  • #3 - Control PII and Sensitive Data Risk for Self-Service BI using Power BI Data Protection (Coming Soon)
  • #4 - Power BI with Azure SQL Data Warehouse enables Advanced Row Level Security with Variable Access at both the Summary and Detail Levels (Coming Soon)

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.