Extracting data from unstructured forms using Azure AI Document Intelligence.

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

 

kevin_comba_1-1715789814699.png

 

 

 

In this blog we are going to take a scenario where our business is a business-to-business(B2B) product to help other businesses extract data from unstructured forms such as pdfs, emails, websites etc. 

In this scenario we are faced with manual extraction of relevant information which is time-consuming and prone to error. 

Let’s see how we can leverage Azure AI Document intelligence and come up with a simple pipeline to ingest data, process and give us structured data. 

 

What you need to get started. 

  1. Azure account with a subscription: To create one use the following link: Azure portal Want to know what azure subscription is? azure subscription 
  2. Azure blob storage: A storage account to store documents which need to be extracted. Learn more about azure blob storage: Azure blob storage docs 

 

What is Azure AI Document Itelligence? 

Azure AI Document Intelligence is a cloud-based document processing system that uses AI (artificial intelligence) and OCR (optical character recognition) to quickly extract text and structure from documents. 

With this service, you can efficiently turn documents into usable data, allowing you to focus on acting on information rather than spending time compiling it. 

 

Illustration of data extraction. 

The image shows Azure AI document intelligence taking the unstructured documents. Using cognitive services and Azure open AI services to extract the data. It then response back to the server where the response is served to the client. 

 

kevin_comba_2-1715789814701.png

 

 

Choosing the appropriate model. 

Available models include: 

  1.  Prebuilt Models  
  2. Custom Models 

Prebuilt models perform document processing without the need to train it. You can automatically extract relevant information from documents. 

Custom models require training to extract distinct data from documents. This allows your system to learn and structure intelligently. 

Learn more about these models: more about available models 

What should you consider when choosing the model? 

  1. Model type – Either using prebuilt model or custom model. 
  2. Document type – Different models are optimized for specific document types i.e., invoices, forms, and receipts. 
  3.  Accuracy – Evaluate the accuracy of the model if it meets the threshold. 
  4. Secuity and compliance – Protect sensitive information during extraction and ensure that the model complies with privacy regulations. 

 

 

Tour to Document Intelligence Studio. 

Note: Azure form recognizer is now called the Azure AI Document Intelligence. 

Document Intelligence Studio is an online tool provided by Microsoft Azure that allows you to visually explore, understand, train models, and integrate features from the Document Intelligence service into your applications. 

We shall be using the OCR Read model because it is a prebuilt model which extracts data from large text-heavy documents like pdfs, scanned images, and HTML documents.  

OCR Read model has various development options in that we can use Document Intelligence Studio, REST APIs and SDK we are provided with. 

 

Step 1: Create Document intelligence resource 

Use the following link to create a resource: Create Document Intelligence Resource, alternatively, you can use the search option, input "document intelligence" then search. 

 

Choose document intelligences from the search results: 

 

 

kevin_comba_3-1715789814702.jpeg

 

 

Click Create: 

 

 

kevin_comba_4-1715789814704.jpeg

 

 

 

Step 2: Fill in the basic details: 

  • Create a resource group with a name of your choice 
  • Select a region near you 
  • Give a unique name to your resource 
  • Choose the pricing tier which suits your needs 
  • Review and create 

 

kevin_comba_5-1715789814705.png

 

 

Leave the other options of networking on default. Then click create, after the deployment is complete, click on go to resource: 

 

Step 3: Copy your API key and your resource endpoint. 

From the side navigation, expand the Resource management and choose keys and endpoint. 

 

kevin_comba_6-1715789814707.jpeg

 

 

 

Step 4: Try different Models in Document Intelligence studio 

I mentioned that we shall use the prebuilt Read OCR model.  

Navigate to Document Intelligence studio to start trying out the model. Use the following link to navigate: documentintelligence . To use the document intelligence studio, paste your key and end point of your resource. 

Now it is time to upload your files that are received from B2B scenario so that we can extract them. I have stored an example of a pdf business letter in the Azure blob storage. Here is a link to the file, copy it: https://storebusinessletters.blob.core.windows.net/businessletters/business%20letters.pdf 

Click on Fetch from URL and paste the given link. 

 

 

kevin_comba_7-1715789898213.jpeg

 

 

 

Configure options gives you the flexibility to customize what you need in the results like range of pages which should be analyzed. I have chosen to extract the first 5 pages. 

 

 

kevin_comba_8-1715789898215.jpeg

 

 

After you are done, click on save and Run analysis of your document. You should get the result in the side panel where the data has been processed and available in a Json format. 

 

kevin_comba_9-1715789898221.png

 

 
We have achieved the goal of extracting data from unstructured documents, they are now structured in a JSON format in key and value pair. The data obtained can then be used to make meaningful decisions. 

Advantages of using Azure AI Document Intelligence. 

  1. Automated Data Extraction: we get the advantage of extracting unstructured documents as the service automatically identifies relevant data. 
  2. Scalability: Azure AI Document Intelligence is cloud-based and can handle large volumes of documents. 
  3. Accuracy and confidence scores: The service gives confidence scores that let you know if the threshold is met. With a well-trained model, there is a reduction in inaccurate data extraction. 
  4. Language support: Whether your documents are in English, Spanish, Chinese, or any other language, Azure AI Document Intelligence can handle them. 

Read more 

 

Code examples on SDKs 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.