AI-powered Data Classification | Microsoft Purview

Posted by

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Identify, classify, and protect information at scale using Microsoft Purview’s AI-powered classifiers. Gain visibility into the data inside of your organization and apply the right protections, especially important if you intend to use generative AI to create content based on the information accessible on your network.




Microsoft Purview’s AI-powered classifiers are extensively pre-trained and tested on vast categories of business data and Microsoft domain-specific knowledge, as well as synthetic data and sample files generated from large language models to detect sensitive content. Because these classifiers run on Microsoft’s AI supercomputer, with its specialized hardware and software stack, your sensitive information can be classified and protected at unparalleled speed and scale.


Tony Themelis, Microsoft Purview’s Principal Product Manager, shares how to keep data inside your organization protected and continuously compliant as you work.


Locate, classify, and protect information at scale.


Use AI to prepare for AI. Get started with Microsoft Purview’s AI-powered automatic data classification.


Build an automated policy to protect data.


See the admin experience for Microsoft Purview’s AI-powered classifiers.


See a 360-degree view in the content explorer.


Automate data classification and protection using AI-powered prebuilt classifiers. Check it out.


Watch our video here.



00:00 — How to keep data in your organization protected

01:12 — Policy tips & sensitivity labels

03:50 — How to build an automated policy

06:14–360 degree view of discovered sensitive information

07:15 — Create a custom classifier

08:58 — Fingerprint-based sensitive information type

09:36 — Wrap up


Link References:

Check out our full series at


Unfamiliar with Microsoft Mechanics?

As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft.


Keep getting this insider knowledge, join us on social:

Video Transcript:

-Visibility into the data inside of your organization and applying the right protections is now more important than ever. In fact, having this level of discipline over your data is especially important if you intend to use generative AI to create content based on the information accessible on your network. If you’re worried about the manual effort, today I’m going to show you how instead you can automatically discover, classify, and protect your information at scale using Microsoft Purview’s AI-powered classifiers, which are extensively pre-trained and tested on vast categories of business data, and Microsoft domain-specific knowledge, as well as synthetic data and sample files generated from large language models to detect sensitive content. And because these classifiers run on Microsoft’s AI supercomputer with its specialized hardware and software stack, your sensitive information can be automatically discovered, classified, and protected at unparalleled speed and scale. In fact, Microsoft Purview can keep the data inside your organization protected and continuously compliant as you work.


-So here for example, we have a user developing internal guidance to create websites. They’ve collected content from various sources into OneNote that they want to include in the documentation. This includes sample code with secrets including an access key to an AWS S3 client and other sensitive information. They move it into Word in order to build more formatted guidance for their team. And as content is copied into Word, notice the policy tip at the top of the document that has been triggered. This alerts them of sensitive information at risk included in the document and recommends a sensitivity label. By clicking on show sensitive content, we can see why. The policy was triggered because the entire document has been classified as containing source code and it also found a credential secret, and this combination is super high risk for the organization. We’re also able to see the exact location of the credentials in the document.


-Next, the user can decide to apply the recommended sensitivity label and save the file. And the highly confidential label is applied as you can see in the top bar. And this classification is also visible in Microsoft 365 locations. I’m in my SharePoint site and I can see the user-saved web design file is flagged here for having sensitive information. And I can click in to see the details on the sensitive information contained in the document and you’ll see that it’s automatically blocked access to everyone except for the file owner, the person who last modified the document, and the owner of the SharePoint site. Additionally, if you’re wondering about the source content in the OneNote file, you’ll see that this was discovered and automatically classified too.


-Now, what I just showed you was just-in-time intervention. This goes beyond what’s possible with manual efforts and helps surface often unaccounted for and sensitive data inside of your organization so that the right guardrails from both a security and compliance perspective can be applied. And by the way, even if the user hadn’t acted on the recommendation to apply the suggested label in the first place, as an admin, you can ensure that the policy engine will automatically apply labels to certain sensitive content even if the user chooses not to. And these protections even extend beyond files to, for example, Teams chat and conversations so that if sensitive information is detected it can be blocked in real time. And of course, we’ve had this capability for a long time in Outlook email as well.


-In fact, this is a good time to now switch gears and look at the admin experience. The good news is that we’ve done much of the heavy lifting for you. Even without you having to configure anything there’s a one-time scan required to get everything working and Microsoft Purview’s broad category of pre-trained and ready to use AI-enabled classifiers provide the depth to discover, classify, and protect data across common business functions with more than 90 categories of sensitive content.


-Now, as I mentioned, all of this happens at scale and to save you time as you configure your labeling and protection policies, many of these controls are preset for you. Let me show you how you can build an automated policy using Microsoft Purview’s classifiers and sensitive information types to get started quickly. In Microsoft Purview’s Data Loss Prevention overview page you’ll find an insights tile. By viewing the detected documents, you’ll see a list of the sensitive information types automatically discovered in SharePoint and OneDrive. From here, I can quickly get started, and by scrolling through this insights card you can clearly see how many files align to categories for medical and healthcare information, IP and Trade Secrets, customer account data and more. I’ll choose to set up a policy for IP and Trade Secrets, the same one that was triggered in our user example.


-And if we take a look at what’s created, you’ll see two new policies, one for device endpoints and one for Microsoft 365 workloads. I’ll click into IP and Trade Secrets M365 Policy. This covers Exchange, SharePoint, OneDrive and Teams chat and channel messages. If I dig into the details under advanced DLP rules it starts with a preset condition for people outside of my organization, and here are the types of content the rule is looking for. Here’s the source code classifier we saw before. I can choose to add or remove items from the list but I’ll keep it as is for now. Scrolling down, you’ll see the action in this case is to restrict access to block people outside of my org, which I can also change if I want. And I can also customize who receives notifications and those messages. I’ll change this message and save it.


-Now, let’s fast forward in time to our fully configured environment where we just flag the source code and credentials. As an admin, we can audit everything that gets discovered from the Purview Compliance Portal. The content explorer as part of data classification is where you can find a 360 degree view of discovered sensitive information and where Trainable Classifiers have detected matching files whether that’s in Exchange, OneDrive, SharePoint, or Teams. I’ll choose source code to find out what our AI model found with that classification. From here, I’ll head over to my SharePoint location and my site. We can see it flagged the user’s web design document from before, and I can click in to see a preview of the file right from here, which makes it easy for reviewers in the content explorer to see the source code and what triggered the match. In fact, I can drill into the doc to take a closer look and from here I’ll confirm the match to validate the AI classifier’s findings. So that’s how easy it is to automate classification and protection using our AI-powered prebuilt classifiers.


-That said, you might be wondering what you can do if you have organization or market vertical-specific documents that you need to protect. Now, in those cases you can always train the machine learning algorithms to your needs by creating your own custom classifier. To create one, all you need to do is give it a name. It’s also a good idea to provide a description. Then you’ll add one or more sites at SharePoint locations containing the files that reflect the classification type we’re training for. I’m going to choose a top level communication site. Then within the site, you can choose one or more folders. In my case, I’ll choose this one with documents related to customer operations. Here you’re giving the custom classifier seed content and you’ll typically want the folders to contain around 50 documents. And once I click create trainable classifier it will take up to 24 hours to build out the prediction model.


-Now, just to speed things up, I’ll explain what happens next with a model that I created previously. Once the custom classifier is ready and published, you’ll find it on this list. I’ll pick the third one down called Transaction Account Statements that has recently been added. Notice it has an accuracy of 94%, and you can see in the type column that this is a custom classifier. From here, you can tune the model to increase its accuracy by confirming whether it is indeed a match or conversely not a match. Once you’ve done this a few times you can track the percentage accuracy until it hits your target level. For example, here you can see we’ve now reached 99% accuracy up from the original 94%.


-That said, you have another option to protect custom file types if you have important sensitive file types and templates including standard forms and design docs or even specific individual files you want to explicitly protect as people share them with others, you can use a fingerprint-based sensitive information type simply by uploading a file. As you create the fingerprint, the configurable confidence levels correspond to future partial or full matches as those files are hashed and compared to this fingerprint. And once created, I can use this fingerprint as a trigger for new or existing compliance policies.


-So that was an overview of how AI-powered classifiers in Microsoft Purview work to automatically identify and classify the massive volumes of diverse data inside your organization at both depth and scale. This can help keep your data protected and continuously compliant. For more information on Microsoft Purview and intelligent and automated data security, check out our full series at Subscribe to Mechanics if you haven’t already and thanks for watching.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.