Extracting data from unstructured forms using Azure AI Document Intelligence.

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

In this blog we are going to take a scenario where our business is a business-to-business(B2B) product to help other businesses extract data from unstructured forms such as pdfs, emails, websites etc.

In this scenario we are faced with manual extraction of relevant information which is time-consuming and prone to error.

Let’s see how we can leverage Azure AI Document intelligence and come up with a simple pipeline to ingest data, process and give us structured data.

What you need to get started.

Azure account with a subscription: To create one use the following link: Azure portal Want to know what azure subscription is? azure subscription
Azure blob storage: A storage account to store documents which need to be extracted. Learn more about azure blob storage: Azure blob storage docs

What is Azure AI Document Itelligence?

Azure AI Document Intelligence is a cloud-based document processing system that uses AI (artificial intelligence) and OCR (optical character recognition) to quickly extract text and structure from documents.

With this service, you can efficiently turn documents into usable data, allowing you to focus on acting on information rather than spending time compiling it.

Illustration of data extraction.

The image shows Azure AI document intelligence taking the unstructured documents. Using cognitive services and Azure open AI services to extract the data. It then response back to the server where the response is served to the client.

Choosing the appropriate model.

Available models include:

Prebuilt Models
Custom Models

Prebuilt models perform document processing without the need to train it. You can automatically extract relevant information from documents.

Custom models require training to extract distinct data from documents. This allows your system to learn and structure intelligently.

Learn more about these models: more about available models

What should you consider when choosing the model?

Model type – Either using prebuilt model or custom model.
Document type – Different models are optimized for specific document types i.e., invoices, forms, and receipts.
Accuracy – Evaluate the accuracy of the model if it meets the threshold.
Secuity and compliance – Protect sensitive information during extraction and ensure that the model complies with privacy regulations.

Tour to Document Intelligence Studio.

Note: Azure form recognizer is now called the Azure AI Document Intelligence.

Document Intelligence Studio is an online tool provided by Microsoft Azure that allows you to visually explore, understand, train models, and integrate features from the Document Intelligence service into your applications.

We shall be using the OCR Read model because it is a prebuilt model which extracts data from large text-heavy documents like pdfs, scanned images, and HTML documents.

OCR Read model has various development options in that we can use Document Intelligence Studio, REST APIs and SDK we are provided with.

Step 1: Create Document intelligence resource

Use the following link to create a resource: Create Document Intelligence Resource, alternatively, you can use the search option, input "document intelligence" then search.

Choose document intelligences from the search results:

Click Create:

Step 2: Fill in the basic details:

Create a resource group with a name of your choice
Select a region near you
Give a unique name to your resource
Choose the pricing tier which suits your needs
Review and create

Leave the other options of networking on default. Then click create, after the deployment is complete, click on go to resource:

Step 3: Copy your API key and your resource endpoint.

From the side navigation, expand the Resource management and choose keys and endpoint.

Step 4: Try different Models in Document Intelligence studio

I mentioned that we shall use the prebuilt Read OCR model.

Navigate to Document Intelligence studio to start trying out the model. Use the following link to navigate: documentintelligence . To use the document intelligence studio, paste your key and end point of your resource.

Now it is time to upload your files that are received from B2B scenario so that we can extract them. I have stored an example of a pdf business letter in the Azure blob storage. Here is a link to the file, copy it: https://storebusinessletters.blob.core.windows.net/businessletters/business%20letters.pdf

Click on Fetch from URL and paste the given link.

Configure options gives you the flexibility to customize what you need in the results like range of pages which should be analyzed. I have chosen to extract the first 5 pages.

After you are done, click on save and Run analysis of your document. You should get the result in the side panel where the data has been processed and available in a Json format.

We have achieved the goal of extracting data from unstructured documents, they are now structured in a JSON format in key and value pair. The data obtained can then be used to make meaningful decisions.

Advantages of using Azure AI Document Intelligence.

Automated Data Extraction: we get the advantage of extracting unstructured documents as the service automatically identifies relevant data.
Scalability: Azure AI Document Intelligence is cloud-based and can handle large volumes of documents.
Accuracy and confidence scores: The service gives confidence scores that let you know if the threshold is met. With a well-trained model, there is a reduction in inaccurate data extraction.
Language support: Whether your documents are in English, Spanish, Chinese, or any other language, Azure AI Document Intelligence can handle them.