Generate embeddings with the Azure AI Vision multi-modal embeddings API

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Welcome to a new learning series about image similarity search with pgvector, an open-source vector similarity search extension for PostgreSQL databases. Throughout this series, we will explore the basics of vector search, familiarize ourselves with the multi-modal embeddings APIs of Azure AI Vision, and build an image similarity search application using Azure Cosmos DB for PostgreSQL.

Our Project

In this series, we will create an application that enables users to search for paintings based on either a reference image or a text description. We will use the SemArt Dataset, which contains approximately 21k paintings gathered from the Web Gallery of Art. Each painting comes with various attributes, like a title, description, and the name of the artist.

The project is divided into two parts: the data pipeline and the vector search pipeline. In the data pipeline, embeddings for the images are generated using Azure AI Vision, and the data is then uploaded into an Azure Cosmos DB for PostgreSQL table. The vector search pipeline involves utilizing the pgvector extension to perform a similarity search on the generated embeddings. This workflow is illustrated in the following image:

Image similarity search workflow

Introduction

Conventional search systems rely on exact matches on properties like keywords, tags, or other metadata, lexical similarity, or the frequency of word occurrences to retrieve similar items. Recently, vector similarity search has transformed the search process. It leverages machine learning to capture the meaning of data, allowing you to find similar items based on their content. The key idea behind vector search involves converting unstructured data, such as text, images, videos, and audio, into high-dimensional vectors (also known as embeddings) and applying nearest neighbor algorithms to find similar data.

In this tutorial, you will:

Describe vector embeddings and vector similarity search.
Use the multi-modal embeddings API of Azure AI Vision for generating vectors for images and text.
Generate vector embeddings for a collection of images of paintings using the Vectorize Image API of Azure AI Vision.

The complete working project can be found in my GitHub repository. If you want to follow along, you can fork the repository and clone it to have it locally available.

Prerequisites

To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:

An Azure subscription - Create an Azure free account or an Azure for Students account.
Python 3.x, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.

Concepts

Vector embeddings

Comparing unstructured data is challenging, in contrast to numerical and structured data, which can be easily compared by performing mathematical operations. What if we could convert unstructured data, such as text and images, into a numerical representation? We could then calculate their similarity using standard mathematical methods.

These numerical representations are called vector embeddings. An embedding is a high-dimensional and dense vector that summarizes the information contained in the original data. Vector embeddings can be computed using machine learning algorithms that capture the meaning of the data, recognize patterns, and identify similarities between the data.

Visualization of word embeddings in a 2-dimensional vector space. Words that are semantically similar are located close together, while dissimilar words are placed farther apart.

Vector similarity

The numerical distance between two embeddings, or equivalently, their proximity in the vector space, represents their similarity. Vector similarity is commonly calculated using distance metrics such as Euclidean distance, inner product, or cosine distance.

Cosine is the similarity metric used by Azure AI Vision. This metric measures the angle between two vectors and is not affected by their magnitudes. Mathematically, cosine similarity is defined as the cosine of the angle between two vectors, which is equal to the dot product of the vectors divided by the product of their magnitudes.

Vector similarity can be used in various industry applications, including recommender systems, fraud detection, text classification, and image recognition. For example, systems can use vector similarities between products to identify similar products and create recommendations based on a user's preferences.

Vector similarity search

A vector search system works by comparing the vector embedding of a user’s query with a set of pre-stored vector embeddings to find a list of vectors that are the most similar to the query vector. The diagram below illustrates this workflow.

Overview of vector similarity search flow

Create vector embeddings with Azure AI Vision

Azure AI Vision provides two APIs for vectorizing image and text queries: the Vectorize Image API and the Vectorize Text API. This vectorization converts images and text into coordinates in a 1024-dimensional vector space, enabling users to search a collection of images using text and/or images without the need for metadata, such as image tags, labels, or captions.

Let’s learn how the multi-modal embeddings APIs work.

Create an Azure AI Vision resource

Open the Azure CLI.
Create a resource group using the following command:
az group create --name your-group-name --location your-location
Create an Azure AI Vision in the resource group that you have created using the following command:
az cognitiveservices account create --name ai-vision-resource-name --resource-group your-group-name --kind ComputerVision --sku S1 --location your-location --yes

Note: The multi-modal embeddings APIs are available in the following regions: East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, West US.

Before using the multi-modal embeddings APIs, you need to store the key and the endpoint of your Azure AI Vision resource in an environment (.env) file.

Use the Vectorize Image API

Let’s review the following example. Given the filename of an image, the get_image_embedding function sends a POST API call to the retrieval:vectorizeImage API. The binary image data is included in the HTTP request body. The API call returns a JSON object containing the vector embedding of the image.

import os from dotenv import load_dotenv import requests # Load environment variables load_dotenv() endpoint = os.getenv("VISION_ENDPOINT") + "computervision/" key = os.getenv("VISION_KEY") def get_image_embedding(image): with open(image, "rb") as img: data = img.read() # Vectorize Image API version = "?api-version=2023-02-01-preview&modelVersion=latest" vectorize_img_url = endpoint + "retrieval:vectorizeImage" + version headers = { "Content-type": "application/octet-stream", "Ocp-Apim-Subscription-Key": key } try: r = requests.post(vectorize_img_url, data=data, headers=headers) if r.status_code == 200: image_vector = r.json()["vector"] return image_vector else: print(f"An error occurred while processing {image}. Error code: {r.status_code}.") except Exception as e: print(f"An error occurred while processing {image}: {e}") return None image_filename = "images/image (1).jpg" image_vector = get_image_embedding(image_filename)

To vectorize a remote image, you would put the URL of the image in the request body.

Use the Vectorize Text API

Similarly to the example above, the get_text_embedding function sends a POST API call to the retrieval:vectorizeText API.

import json def get_text_embedding(prompt): text = {'text': prompt} # Image retrieval API version = "?api-version=2023-02-01-preview&modelVersion=latest" vectorize_txt_url = endpoint + "retrieval:vectorizeText" + version headers = { 'Content-type': 'application/json', 'Ocp-Apim-Subscription-Key': key } try: r = requests.post(vectorize_txt_url, data=json.dumps(text), headers=headers) if r.status_code == 200: text_vector = r.json()['vector'] return text_vector else: print(f"An error occurred while processing the prompt '{text}'. Error code: {r.status_code}.") except Exception as e: print(f"An error occurred while processing the prompt '{text}': {e}") return None text_prompt = "a blue house" text_vector = get_text_embedding(text_prompt)

Generate vector embeddings for a collection of paintings

Now that you've familiarized yourself with the Vectorize Image API for computing image vector embeddings, let's generate embeddings for the images in our dataset.

Data preprocessing

For our application, we'll be working with a subset of the SemArt Dataset. In my GitHub repository, you can find the data_preprocessing.ipynb Jupyter Notebook which cleans up the dataset and removes unnecessary information. After running this notebook, your dataset will comprise 11,206 images of paintings.

You are now all set up to generate embeddings for your images.

Compute vector embeddings

To generate embeddings for the images, our process can be summarized as follows:

Retrieve the filenames of the images in the dataset.
Divide the data into batches, and for each batch, perform the following steps:
1. Compute the vector embedding for each image in the batch using the Vectorize Image API of Azure AI Vision.
2. Save the vector embeddings of the images along with the filenames into a file.
Update the dataset by inserting the vector embedding of each image.

The code for vector embeddings generation can be found at data_processing/generate_embeddings.py. In the following sections, we will discuss specific segments of the code.

Compute embeddings for the images in the dataset

The compute_embeddings function computes the vector embeddings for all the images in our dataset. It uses the ThreadPoolExecutor object to generate vector embeddings for each batch of images efficiently, utilizing multiple threads. The tqdm library is also utilized in order to provide progress bars for better visualizing the embeddings generation process.

def compute_embeddings(image_names: list[str]) -> None: """ Computes vector embeddings for the provided images and saves the embeddings alongside their corresponding image filenames in a CSV file. :param image_names: A list containing the filenames of the images. """ image_names_batches = [ image_names[i:(i + BATCH_SIZE)] for i in range(0, len(image_names), BATCH_SIZE) ] for batch in tqdm(range(len(image_names_batches)), desc="Computing embeddings"): images = image_names_batches[batch] with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: embeddings = list( tqdm( executor.map( lambda x: get_image_embedding( image=os.path.join(images_folder, x), ), images, ), total=len(images), desc=f"Processing batch {batch+1}", leave=False, ) ) valid_data = [ [images[i], str(embeddings[i])] for i in range(len(images)) if embeddings[i] is not None ] save_data_to_csv(valid_data)

Once the embeddings for all the images in a batch are computed, the data is saved into a CSV file.

def save_data_to_csv(data: list[list[str]]) -> None: """ Appends a list of image filenames and their associated embeddings to a CSV file. :param data: The data to be appended to the CSV file. """ with open(embeddings_filepath, "a", newline="") as csv_file: write = csv.writer(csv_file) write.writerows(data)

Azure AI Vision API rate limits

Azure AI Vision API imposes rate limits on its usage. In the free tier, only 20 transactions per minute are allowed, while the standard tier allows up to 30 transactions per second, depending on the operation (Source: Microsoft Docs). If you exceed the default rate limit, you'll receive a 429 HTTP error code.

For our application, it is recommended to use the standard tier during the embeddings generation process and limit the number of requests per second to approximately 10 to avoid potential issues.

Generate the dataset

After computing the vector embeddings for all images in the dataset, we proceed to update our dataset by inserting the vector embedding for each image. In the generate_dataset function, the merge method of pandas.DataFrame is used for merging the dataset with a database-style join.

def generate_dataset() -> None: """ Appends the corresponding vectors to each column of the original dataset and saves the updated dataset as a CSV file. """ dataset_df = pd.read_csv(dataset_filepath, sep="\t", dtype="string") embeddings_df = pd.read_csv( embeddings_filepath, dtype="string", names=[IMAGE_FILE_CSV_COLUMN_NAME, EMBEDDINGS_CSV_COLUMN_NAME], ) final_dataset_df = dataset_df.merge( embeddings_df, how="inner", on=IMAGE_FILE_CSV_COLUMN_NAME ) final_dataset_df.to_csv(final_dataset_filepath, index=False, sep="\t")

Next steps

In this post, you’ve learned the basics of vector search and computed vector embeddings for a collection of images using the Azure AI Vision Vectorize Image API. In the next post, you will store and query the vector embeddings on Azure Cosmos DB for PostgreSQL using the pgvector extension.

Here are some helpful learning resources: