This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .
In a Retrieval-Augmented Generation (RAG) setup, user-specified filters, whether implied or explicit, can often be overlooked during vector searches, as the vector search primarily focuses on semantic similarity.
In some scenarios, it’s essential to ensure that specific queries are answered exclusively using a predefined (sub)set of the documents. By using “metadata” or tags, you can enforce the type of documents that should be used for each type of user query. This can even turn into a security overlay policy when each users queries are tagged with their credentials / auth level with filters so that their queries are answered with documents at their auth level.
When RAG data consists of numerous separate data objects (e.g., files), each data object can be tagged with a predefined set of metadata. These tags then can serve as filters during vector or hybrid search. Metadata can be incorporated into the search index alongside vector embeddings and subsequently used as filters.
In this blog, we will demonstrate an example implementation…
For the sake of demonstration, in this blogpost will use Wikipedia articles of movies as our documents. We will than tag these movie files with metadata such as genre
, releaseYear
, and director
, and later use this metadata to filter on RAG generations.
Please note that an LLM can also be used to “classify” the documents before they are uploaded to the search index for deployment at a larger scale. When a user enters a prompt, we can use an additional LLM call to classify the user prompt (match a set of metadata) and later use it to filter out results. Blogpost demonstrates a simpler use-case where RAG documents (the wikipedia pages saves as pdf files and pre-tagged with the movie metadata…
1. Classify documents and tag with metadata
movies = [
{"id": "1", "title": "The Shawshank Redemption", "genre": "Drama", "releaseYear": 1994, "director": "Frank Darabont"},
{"id": "2", "title": "The Godfather", "genre": "Crime", "releaseYear": 1972, "director": "Francis Ford Coppola"},
{"id": "3", "title": "The Dark Knight", "genre": "Action", "releaseYear": 2008, "director": "Christopher Nolan"},
{"id": "4", "title": "Schindler's List", "genre": "Biography", "releaseYear": 1993, "director": "Steven Spielberg"},
{"id": "5", "title": "Pulp Fiction", "genre": "Crime", "releaseYear": 1994, "director": "Quentin Tarantino"},
{"id": "6", "title": "The Lord of the Rings: The Return of the King", "genre": "Fantasy", "releaseYear": 2003, "director": "Peter Jackson"},
{"id": "7", "title": "The Good, the Bad and the Ugly", "genre": "Western", "releaseYear": 1966, "director": "Sergio Leone"},
{"id": "8", "title": "Fight Club", "genre": "Drama", "releaseYear": 1999, "director": "David Fincher"},
{"id": "9", "title": "Forrest Gump", "genre": "Drama", "releaseYear": 1994, "director": "Robert Zemeckis"},
{"id": "10", "title": "Inception", "genre": "Sci-Fi", "releaseYear": 2010, "director": "Christopher Nolan"}
]
2. Creating the Azure AI Search index…
We need to create an Azure AI search index which will have the metadata fields as “searchable” and “filterable” fields. Below is the schema definition we will use.
First define the schema in JSON....
{
"name": "movies-index",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true, "filterable": false, "sortable": false },
{ "name": "title", "type": "Edm.String", "filterable": true, "searchable": true },
{ "name": "genre", "type": "Edm.String", "filterable": true, "searchable": true },
{ "name": "releaseYear", "type": "Edm.Int32", "filterable": true, "sortable": true },
{ "name": "director", "type": "Edm.String", "filterable": true, "searchable": true },
{ "name": "content", "type": "Edm.String", "filterable": false, "searchable": true },
{
"name": "contentVector",
"type": "Collection(Edm.Single)",
"searchable": true,
"retrievable": true,
"dimensions": 1536,
"vectorSearchProfile": "my-default-vector-profile"
}
],
"vectorSearch": {
"algorithms": [
{
"name": "my-hnsw-config-1",
"kind": "hnsw",
"hnswParameters": {
"m": 4,
"efConstruction": 400,
"efSearch": 500,
"metric": "cosine"
}
}
],
"profiles": [
{
"name": "my-default-vector-profile",
"algorithm": "my-hnsw-config-1"
}
]
}
}
Then run the following script to create the index with a REST API call to Azure AI Search service...
RESOURCE_GROUP="[your-resource-group]"
SEARCH_SERVICE_NAME="[search-index-name]"
API_VERSION="2023-11-01"
API_KEY="[your-AI-Search-API-key"
SCHEMA_FILE="movies-index-schema.json"
curl -X POST "https://${SEARCH_SERVICE_NAME}.search.windows.net/indexes?api-version=${API_VERSION}" \
-H "Content-Type: application/json" \
-H "api-key: ${API_KEY}" \
-d @${SCHEMA_FILE}
Once the Azure AI Search index is created confirm in the portal that the metadata fields are marked as filterable and searchable…
3. Embed and upload document chunks to the Azure AI Search index with their metadata
The documents we will use are the wikipedia pages for the movies saved as pdf files. To integrate the documents to LLM's in a RAG patter first we will "pre-process" the documents. The below code first opens a specified PDF file with extract_text_from_pdf
function , reads its content using the PdfReader
class, and extracts the text from each page, combining all the text into a single string. The normalize_text
function takes a text string and removes any unnecessary whitespace, ensuring the text is normalized into a single continuous string with spaces. The chunk_text
function then takes this normalized text and splits it into smaller chunks, each no larger than a specified size (default 6000 characters). This is done by tokenizing the text into sentences and grouping them into chunks while ensuring each chunk does not exceed the specified size, making the text easier to manage and process in smaller segments.
# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
text = ""
with open(pdf_path, "rb") as file:
reader = PdfReader(file)
for page in reader.pages:
text += page.extract_text() + "\n"
return text
# Function to normalize text
def normalize_text(text):
return ' '.join(text.split())
# Function to chunk text into smaller pieces
def chunk_text(text, chunk_size=6000):
sentences = sent_tokenize(text)
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
if current_length + len(sentence) > chunk_size:
chunks.append(' '.join(current_chunk))
current_chunk = [sentence]
current_length = len(sentence)
else:
current_chunk.append(sentence)
current_length += len(sentence)
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
We hen embed each chunk and upload the embedding along with document metadata to the previously created Azure AI Search index.
4. Unfiltered Vector Search
First let's make a vector search with a relatively generic prompt which will match multiple document chunks…Notice the explicit filter statement “The movie was cast in 2010”. Note the vector search cannot successfully interpret the stated filter and incorrect results (movies that were cast long before 2010) are returned too in the search result.
# Generate embedding for the plot prompt
plot_prompt = "An individual faces insurmountable odds and undergoes a transformative journey, \
uncovering hidden strengths and forming unexpected alliances. Through resilience and cunning, \
they navigate a world filled with corruption, betrayal, and a fight for justice, \
ultimately discovering their true purpose. The movie was cast in 2010"
prompt_embedding_vector = generate_embeddings(plot_prompt)
payload = {
"count": True,
"select": "title, content, genre",
"vectorQueries": [
{
"kind": "vector",
"vector": prompt_embedding_vector,
"exhaustive": True,
"fields": "contentVector",
"k": 5
}
],
# "filter": "genre eq 'Drama' and releaseYear ge 1990 and director eq 'Christopher Nolan'"
}
response = requests.post(url, headers=headers, data=json.dumps(payload))
if response.status_code == 200:
results = response.json()
print("Results with pre-filter:")
for result in results['value']:
print(result)
else:
print(f"Error: {response.status_code}")
print(response.json())
Results without pre-filter:
{'@search.score': 0.83729386, 'title': 'The Shawshank Redemption', 'genre': 'Drama', 'content': '...'}
{'@search.score': 0.83415884, 'title': 'The Shawshank Redemption', 'genre': 'Drama', 'content': '...'}
{'@search.score': 0.8314112, 'title': 'Inception', 'genre': 'Sci-Fi', 'content': '...'}
{'@search.score': 0.8308051, 'title': 'The Lord of the Rings: The Return of the King', 'genre': 'Fantasy', 'content': '...'}
5. (pre)Filtered Vector Search
Next, add a filter to the vector search…The filter is defined as any document chunk whose releaseYear metadata value (int32) is greater than 2010. In this case only the correct search result, document chunks from the movie “Inception” are returned.
payload_with_release_year_filter = {
"count": True,
"select": "title, content, genre, releaseYear, director",
"filter": "releaseYear eq 2010",
"vectorFilterMode": "preFilter",
"vectorQueries": [
{
"kind": "vector",
"vector": prompt_embedding_vector,
"exhaustive": True,
"fields": "contentVector",
"k": 5
}
]
}
Results with pre-filter:
{'@search.score': 0.8314112, 'title': 'Inception', 'genre': 'Sci-Fi', 'releaseYear': 2010, 'director': 'Christopher Nolan', 'content': '...'}
{'@search.score': 0.83097535, 'title': 'Inception', 'genre': 'Sci-Fi', 'releaseYear': 2010, 'director': 'Christopher Nolan', 'content':'...'}
{'@search.score': 0.83029956, 'title': 'Inception', 'genre': 'Sci-Fi', 'releaseYear': 2010, 'director': 'Christopher Nolan', 'content': '...'}
{'@search.score': 0.82646775, 'title': 'Inception', 'genre': 'Sci-Fi', 'releaseYear': 2010, 'director': 'Christopher Nolan', 'content': '...'}
{'@search.score': 0.8255407, 'title': 'Inception', 'genre': 'Sci-Fi', 'releaseYear': 2010, 'director': 'Christopher Nolan', 'content': '...'}
Conclusion:
This blog presented a simple scenario where document chunks are embedded and uploaded to an Azure Search Index with document metadata as searchable and filterable fields.
The concept can be extended such that an additional llm query step can be used to "classify" user prompts and infer the metadata that will be applied for pre/post filtering the vector search matched chunks. Again documents themselves can be tagged with metadata using an LLM call rather than relying on static human annotation as demonstrated in this example.
References:
- Filters in vector queries documentation
- Create a vector query in Azure AI Search documentation
Hope you enjoyed the content. Let me know any comments / feedback below...
Ozgur Guler
July 24, Istanbul