This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .
Building a Document Intelligence Custom Classification Model with the Python SDK
Introduction:
In the world of document processing and automation, one of the most frequent use cases is categorizing and organizing documents into predefined classes. For instance, an organization may have a process that ingests documents that then need to be classified into seperate categories such as “invoices”, “contracts”, “reports”, etc. Azure AI Document Intelligence custom classification models can address these needs and offer a powerful way to bring order to document management.
Document Intelligence is a cloud-based Azure AI service that uses machine learning models to automate document processing in applications and workflows. New users and those unfamiliar with Document Intelligence's capabilities may be interested in starting their journey using Document Intelligence Studio—an online tool to visually explore, understand, train, and implement features from the Document Intelligence service without having to write a single line of code. However, more advanced use cases and integrations may necessitate interacting with the Document Intelligence service programmatically. This can be achieved using the Document Intelligence REST API or SDKs available for .NET, Java, JavaScript, and Python. In this article we'll focus specifically on building a custom classification model using Python, one of the more popular languages amongst data science and machine learning developers.
Those wanting to get a head start creating a custom classification model programmatically may look to utilize the existing sample_build_classifier.py code sample from the azure-sdk-for-python repository. However, for this sample script to work, the classifier training data set must already include ocr.json files for each document. Optical Character Recognition (OCR) is a critical step in converting scanned documents into editable and searchable data. While Azure AI Document Intelligence Studio automatically generates the OCR files behind-the-scenes when building a custom classification model using the visual interface, those utilizing the Python SDK may find themselves at a crossroads due to the lack of this built-in functionality.
The Challenge:
The Document Intelligence Python SDK provides a powerful set of tools for extracting information from forms and documents. However, one key limitation is its lack of a method to easily generate ocr.json files from layout analysis results, a feature that is completely integrated and handled automatically in Document Intelligence Studio.
As described in the documentation here, the required ocr.json files can be created by analyzing each training document with Document Intelligence's pre-built layout model and saving the results in the proper API response format. There is a sample Python script sample_analyze_layout.py but since the SDK's layout results object is structured differently than the API's layout results object, there isn't a clear way to generate the required ocr.json files strictly using the Python SDK. This blog post delves into the custom solution we developed to manually code this process, addressing a common problem discussed in the Microsoft community
Our Custom Solution:
Step-by-Step Guide to Building the Classifier:
analyze_layout.py
, which will iterate through files in the specified directory (TRAINING_DOCUMENTS
) and analyze each document using Azure AI Document Intelligence. It saves the results in a .ocr.json
file alongside the original document. This format mirrors the OCR output of the Document Intelligence Studio, maintaining consistency and compatibility.with open(document_file_path, "rb") as f:
# Use begin_analyze_document to start the analysis process, and use a callback in order to recieve the raw response
poller = document_analysis_client.begin_analyze_document(
"prebuilt-layout", document=f, cls=lambda raw_response, _, headers: create_ocr_json(ocr_json_file_path, raw_response)
)
upload_documents.py
, which will upload all the training documents, along with the .ocr.json files and a .jsonl file that will be used in building the classifier to reference each of the documents. The .jsonl file allows us to process multiple documents in a batch, improving the efficiency of the training process.
build_classifier
script initiates the process of building a custom document classifier using the document types and labeled data from the .jsonl
files. It utilizes the DocumentModelAdministrationClient
and the BlobServiceClient
, which are used to interface with the Document Intelligence and Azure Blob Storage services to retrieve and process the training data uploaded in the previous step. Once finished, it prints the results including the classifier ID, API version, description, and document classes used for training.
build_classifier.py
uses two requests together to classify a document using a trained document classifier. The first request sends the document for classification, and the second request retrieves the results of the classification process. This approach allows for asynchronous processing of document classification, where the analysis can take some time to complete, especially for large or complex documents.- POST Request: The
_post_to_classification_model
function performs a POST request to the Azure AI classification model for prediction. It uses the specified Document Intelligence key and model specifications to post the document for classification. The request URL includes the classifier model ID and the API version. The function reads the document as binary data and sends it in the request body along with the necessary headers. If the POST request is successful, it returns the response.post_url = ( ENDPOINT + f"/documentintelligence/{API_TYPE}/{MODEL_ID}:analyze?api-version={API_VERSION}" )
- GET Request: The
_get_classification_results
function retrieves the classification results from the Azure AI classification model. It takes the response from the POST request as input and extracts theoperation-location
URL from the response headers. It then makes GET requests to this URL in a loop, waiting for the analysis to complete. It retries the GET request multiple times until the analysis succeeds, fails, or reaches a maximum number of retries. Once the analysis is complete, it returns the classification results as a JSON object.get_url = post_response.headers["operation-location"] resp = get( url=get_url, headers={ "Ocp-Apim-Subscription-Key": FORM_RECOGNIZER_KEY }, ) ... result = _get_classification_results(request)["analyzeResult"]
Conclusion: