Process Large scale PDF or images to Extract information forms using Applied AI Form Recognizer

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Optimized large scale forms processing using Applied AI Services

 

Thanks to co-authors  @Michael McKechney  @lee Hansen

 

Use Case

  • Have millions of forms to process
  • Like 15 to 20 million pages or more
  • These are forms with multiple pages but only few pages might have the data to extract
  • Forms might have 3-15 pages or more
  • Data to pull might be 2 or 3 pages
  • Split the pages to process in Form recognizer to reduce AI cost
  • Use python ai library to filter the pages needed for AI services
  • Process is split into 2 sections
    1. Process the pages needed for AI services
    1. Process the pages needed for the AI and send that to form Recognizer
  • Idea here is to show how to preprocess PDF or images to extract needed info for AI Cognitive Services to process.
  • Both the below steps can be scaled as needed based on requirements

 

Architecture

 

Architecture - End to End processingArchitecture - End to End processing

 

 

 

2 Parts processing

 

Azure Python Function

  • Python function to process PDF to only pick pages needed to process in AI
  • Instead of 15 million pages can be reduced to 2 or 3 million pages
  • Using existing open-source packages like pytesseract to pull only pages needed
  • Scale pdf processing using azure functions
  • https://github.com/balakreshnan/PythonAIFunction

 

Azure C# function to process Form Recognizer

  • Functions to take the reduced pages and send to Form Recognizer
  • Process form recognizer output save to SQL for further reporting
  • Azure analytics is used for further data processing
  • Scale functions as needed to process forms
  • Reduced form sends 2 to 3 million requests rather than 15 million pages to AI services
  • https://github.com/balakreshnan/HighThroughputFormRecognizer

 

Above process shows how we can process large scale pdf, images for various use cases and also control Azure Applied AI cost. Same process can be used for Event driven and Batch processing.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.