Automating document indexing into Azure Cosmos DB with Logic Apps

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Introduction

Effectively managing large document volumes is essential for modern applications, particularly to maintain fast and reliable querying. With Azure Logic Apps, you can now automate document indexing into Azure Cosmos DB, in addition to the existing capability of indexing in AI Search, offering the flexibility to use either service as a vector store.

Logic Apps offers a rich set of connectors that allow seamless integration with various document sources such as Azure Blob Storage, SharePoint, and OneDrive, enabling automated workflows for document ingestion from multiple locations. Whether you're working with PDFs, Word documents, or structured data files like CSVs, Logic Apps supports parsing different document types efficiently.

For larger documents, Logic Apps can also implement chunking, breaking down files into manageable parts to optimize processing and indexing. This ensures even complex or large datasets are handled smoothly without overwhelming system resources.

In terms of integration with Azure Cosmos DB, the Logic Apps Cosmos DB connector supports multiple authentication methods, including Managed Identity, Shared Key Authentication, and Azure Active Directory OAuth, providing flexibility depending on your security requirements. Additionally, Logic Apps can meet various networking needs, such as integrating with private endpoints or using VNet integration to secure communication between services.

In this post, we’ll walk through a scenario where Logic Apps automates the ingestion and indexing of documents, such as PDFs, into Azure Cosmos DB. This approach not only reduces operational overhead but also ensures that your data remains highly accessible and queryable.

Why use Logic Apps for document indexing in Cosmos DB?

Automated Workflows: By automating document indexing, you eliminate manual tasks and ensure that documents are indexed as soon as they are uploaded.
Scalability: As your document volume grows, Azure Cosmos DB’s global distribution ensures your data remains scalable and highly available.
Seamless Integration: Logic App enables you to easily integrate with other Azure services, such as Blob Storage and AI models, enhancing your document indexing with intelligence and automation.

Scenario Overview

In this scenario, we automate the ingestion of document content from Azure Blob Storage, parsing it, and indexing it into Azure Cosmos DB. When a blob (such as a PDF or text document) is uploaded, a Logic App workflow is triggered to process the document and store its data in a Cosmos DB container, making it easily retrievable and queryable.

Pre-requisites

To set up the scenario on your machine, please make sure to set up:

Azure CosmosDB resource to index data into
Azure Storage account to upload content to be indexed

Setting up Azure CosmosDB

Once you have created the resource:

Navigate to the Azure CosmosDB resource
From the “Settings” menu, select “Features”
Enable the feature for “Vector Search in Azure Cosmos DB for NoSQL”

The steps could also be found in detail in this blogpost from Azure CosmosDB. Now that you have the CosmosDB resource setup as an index store, let’s create a new database and a container for the vector store database.

To create a new container:

Navigate to “Data Explorer”
Create a “New Container” with the following field set for:
1. Database id: This is your databaseID, in our case it is ‘docs’
2. Container id: Container in which your documents will be stored, we have it defined as ‘category’
3. Partition key: for data distribution, we have defined it as ‘/category”, given there could be other categories of documents you may want to query that have been indexed.
Container Vector Policy: this is where we set the vector properties for 'Vector Embedding 1
1. Path: where to search and represent the vector embeddings from. In our case it will be ‘/vector’.
2. Data type: float32
3. Distance function: this will be used to determine distance between the closest neighbors. In our case set it to ‘cosine’
4. Dimensions: 1056
5. Index type: diskANN, as it is low-cost, scalable and improved latency option for finding Approximate Nearest Neighbors (ANN)

You can find more information on the container set up from this GitHub tutorial.

Document Structure for Indexing in Azure Cosmos DB

In this Logic Apps workflow, we're indexing document embeddings into Azure Cosmos DB. Below is a breakdown of the key fields we’re mapping and indexing:

content: This field holds the main body of the document or the actual text content that has been processed. For example, this could be the textual data extracted from a document like a contract, invoice, or any other file type.
documentName: The name or title of the document being indexed. This field helps in identifying the document based on its file name, making it easier to search and retrieve the document by its original name.
vector: This represents the embeddings vector of the document, which is a numerical This irepresentation of the content. These vectors are used to perform similarity searches on documents, allowing for AI-driven insights or matching based on content similarity.
docId: A unique identifier generated for each document. This ensures that each document has a distinct ID, which is crucial for querying and updating specific items in the Cosmos DB container.
category: This is where the document type or category is assigned. In our case, we’re using "documents" as the value for this field. This helps in classifying and grouping documents, which can be useful when querying for specific types of documents within the database.
id: Another unique identifier, often auto-generated or derived by concatenating values. This ID could be used to ensure that there is no duplication and that each document is properly referenced

It will look like this when we compose the payload to pass in Azure CosmosDB from logic app workflow:

Key Steps in the Workflow

I have added a GitHub sample for the workflow project. Here’s a visual representation of the workflow:

Blob Upload Detection: The Logic App starts by detecting when a new blob (document) is added or updated in Azure Blob Storage.
Read Blob Content: The workflow reads the content of the uploaded blob and prepares it for further processing.
Document Parsing: Logic Apps parses the document, extracting the relevant content, such as text or metadata. This can include PDF extraction or text chunking for larger documents.
Chunk Text (if needed): For larger documents, the content is split into manageable chunks to ensure smooth processing and indexing.
Generate Embeddings Using AI: Using Azure AI, the Logic App generates embeddings from the document content. These embeddings allow for enhanced data processing, categorization, and structure mapping within Cosmos DB.
Map to Schema: The extracted data and embeddings are mapped to a predefined schema to ensure consistency in how documents are indexed within Cosmos DB. Here are the properties we are indexing:
Bulk Update in Cosmos DB: Finally, the processed document is stored and indexed in Cosmos DB. The "Create or update many items in bulk" action accepts the database and container ID alongside the data to be indexed in which multiple items are processed into the database from previous action.

Conclusion

By leveraging Azure Logic Apps to automate document indexing into Azure Cosmos DB, you can streamline data workflows, reduce manual intervention, and ensure your data is organized for optimal performance. This powerful integration simplifies the process, making it easier for teams to manage large volumes of documents and scale as needed.

What’s next

Currently, Logic Apps support efficient document indexing in Cosmos DB, but Vector Search for AI-driven retrieval is not yet available. This much-anticipated feature that will enhance Cosmos DB as a powerful vector store. Stay tuned for this update!