This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .
Teach ChatGPT to Answer Questions Based on PDF content: Using Azure Cognitive Search and Azure OpenAI
Can't I just copy and paste text from a PDF file to teach ChatGPT?
This tutorial is related to the following topics
Learning objectives
Prerequisites
Microsoft Cloud Technologies used in this Tutorial
Table of Contents
Series 2: Implement a ChatGPT Service with Azure OpenAI
Series 1: Extract Key Phrases for Search Queries Using Azure Cognitive Search
1. Create a Blob Container
2. Store PDF Documents in Azure Blob Storage
3. Create a Cognitive Search Service
NOTE:
NOTE:
4. Connect to Data from Azure Blob Storage
5. Add Cognitive Skills
6. Customize Target Index and Create an Indexer
This field indicates the actual content of the stored data.
2. metadata_storage_content_type (Edm.String)
This field indicates the type of content stored.
Ex) The metadata_storage_content_type of `example.pdf` is `pdf`
3. metadata_storage_size (Edm.Int64)
This field Indicates the size of the stored data. The size information is stored as an
integer.
Ex) The metadata_storage_size of `example.pdf` is ` 487743`(bytes).
4. metadata_storage_last_modified (Edm.DateTimeOffset)
This field indicates the most recent modification date and time of the stored data.
Ex) The metadata_storage_last_modified of `example.pdf` is `2023-10-
06T18:45:32+00:00`.
5. metadata_storage_content_md5 (Edm.String)
This field indicates a checksum value for the data, which is used to validate the
integrity of the content during transmission or storage. The MD5 hash value is
represented as a string of alphanumeric characters.
Ex) The metadata_storage_content_md5 of `example.pdf` is
`d41d8cd98f00b204e9800998ecf8427e`
6. metadata_storage_name (Edm.String)
This field indicates a file name stored in blob storage.
Ex) The metadata_storage_name of `example.pdf` is `example.pdf `
7. metadata_storage_path (Edm.String)
This field Indicates the storage path where the data file or object resides within the
Azure storage architecture.
Ex) The metadata_storage_path of `example.pdf` is
`https://yourstorageaccount.blob.core.windows.net/testcontainer/example.pdf`
8. metadata_storage_file_extension (Edm.String)
This field indicates the file extension.
Ex) The metadata_storage_file_extention of `example.pdf` is `.pdf `
9. metadata_content_type (Edm.String)
This field Indicates the nature of the internal content, such as whether it is text,
HTML, JSON, etc.
Ex) The metadata_content_type of `example.pdf` is `text`.
10. metadata_language (Edm.String)
This field indicates the language in which the content is written, facilitating languagespecific processing and searching.
Ex) The metadata_language of `example.pdf` is `EN`.
11. metadata_creation_date (Edm.DateTimeOffset)
This field indicates the date and time when the data was originally created.
Ex) The metadata_creation_date of `example.pdf` is `2023-09-30T14:32:10+00:00`.
7. Extract Key Phrases for Search Queries Using Azure Cognitive Search
Series 2: Implement a ChatGPT Service with Azure OpenAI
1. Change your indexer settings to use Azure OpenAI
"outputFieldMappings": [
{
"sourceFieldName": "/document/content/pages/*/keyphrases/*",
"targetFieldName": "keyphrases"
},
{
"sourceFieldName": "/document/content/pages/*",
"targetFieldName": "pages"
}
]
2. Create an Azure OpenAI
3. Set up the project and install the libraries
mkdir azure-proj
cd azure-proj
mkdir gpt-proj
cd gpt-proj1
Python -m venv .venv
.venv\Scripts\activate.bat
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install OpenAI
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install Langchain
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install faiss-cpu
(.venv) C:\Users\sms79\azure-proj\gpt-proj1>pip install tiktoken
4. Set up the project in VS Code
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
5. Search with Azure Cognitive Search
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The @search.rerankerScore value ranges from 1 to 4.00, where a higher score indicates a better semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# execute the main function
if __name__ == "__main__":
main()
6. Get answers from PDF content using Azure OpenAI and Cognitive Search
Now that Azure Cognitive Search is working well in VS Code, it's time to start using
Azure OpenAI.
In this chapter, we'll create functions related to Azure OpenAI and ultimately create
and run a program in `main.py` that answers a question with Azure OpenAI based on
the search information from Azure Cognitive Search.
1. We will create functions related to Azure OpenAI and Lang Chain and run them from
the main function.
- Add the following functions above the main function.
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
2. Add the code below to your main function.
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
3. Now let's run it and see if it answers your question.
- The result of executing the code.
```
Total Documents Found: 5, Top Documents: 3
Number of chunks: 10
Question: Tell me about effective prompting strategies
Answer: Effective prompting strategies for improving the reliability of GPT-3 include
establishing simple prompts that improve GPT-3's reliability in terms of generalizability,
social biases, calibration, and factuality. These strategies include prompting with
randomly sampled examples from the source domain, using examples sampled from
a balanced demographic distribution and natural language intervention to reduce
social biases, calibrating output probabilities, and updating the LLM's factual
knowledge and reasoning chains. Natural language intervention can also effectively
guide model predictions towards better fairness.
Reference: Prompting GPT-3 To Be Reliable.pdf
```
Note: Full code for example.py and config.py
# Azure Search Service settings
SEARCH_SERVICE_NAME = 'your-search-service-name' # 'test-search-service1'
SEARCH_SERVICE_ENDPOINT = f'https://{SEARCH_SERVICE_NAME.lower()}.search.windows.net/'
SEARCH_SERVICE_KEY = 'your-search-service-key'
SEARCH_SERVICE_API_VERSION = 'your-API-version' # '2023-07-01-preview'
# Azure Search Service Index settings
SEARCH_SERVICE_INDEX_NAME1 = 'your-search-service-index-name' # 'azureblob-index1'
# Azure Cognitive Search Service Semantic configuration settings
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME = 'your-semantic-configuration-name' # 'test-configuration'
# Azure OpenAI settings
AZURE_OPENAI_NAME = 'your-openai-name' # 'testopenai1004'
AZURE_OPENAI_ENDPOINT = f'https://{AZURE_OPENAI_NAME.lower()}.openai.azure.com/'
AZURE_OPENAI_KEY = 'your-openai-key'
AZURE_OPENAI_API_VERSION = 'your-API-version' # '2023-08-01-preview'
# Library imports
from collections import OrderedDict
import requests
# Langchain library imports
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
# Configuration imports
from config import (
SEARCH_SERVICE_ENDPOINT,
SEARCH_SERVICE_KEY,
SEARCH_SERVICE_API_VERSION,
SEARCH_SERVICE_INDEX_NAME1,
SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
AZURE_OPENAI_ENDPOINT,
AZURE_OPENAI_KEY,
AZURE_OPENAI_API_VERSION,
)
# Cognitive Search Service header settings
HEADERS = {
'Content-Type': 'application/json',
'api-key': SEARCH_SERVICE_KEY
}
# Function to search documents using Azure Cognitive Search
def search_documents(question):
url = (SEARCH_SERVICE_ENDPOINT + 'indexes/' +
SEARCH_SERVICE_INDEX_NAME1 + '/docs')
params = {
'api-version': SEARCH_SERVICE_API_VERSION,
'search': question,
'select': '*',
'$top': 3,
'queryLanguage': 'en-us',
'queryType': 'semantic',
'semanticConfiguration': SEARCH_SERVICE_SEMANTIC_CONFIG_NAME,
'$count': 'true',
'speller': 'lexicon',
'answers': 'extractive|count-3',
'captions': 'extractive|highlight-false'
}
resp = requests.get(url, headers=HEADERS, params=params)
return resp.json()
# Extract documents that score above a certain threshold in semantic search
def extract_documents(search_results):
file_content = OrderedDict()
for result in search_results['value']:
# The '@search.rerankerScore' range is 1 to 4.00, where a higher score indicates a stronger semantic match.
if result['@search.rerankerScore'] > 1.5:
file_content[result['metadata_storage_path']] = {
'chunks': result['pages'][:10],
'captions': result['@search.captions'][:10],
'score': result['@search.rerankerScore'],
'file_name': result['metadata_storage_name']
}
return file_content
# Function to create an embedding model
def create_embeddings():
return OpenAIEmbeddings(
openai_api_type='azure',
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
deployment='text-embedding-ada-002',
model='text-embedding-ada-002',
chunk_size=1
)
# Function to create a Vectorstore
def create_vector_store(docs, embeddings):
return FAISS.from_documents(docs, embeddings)
# Function to retrieve search results using Langchain and Chatgpt
def search_with_langchain(vector_store, question):
llm = AzureChatOpenAI(
openai_api_key=AZURE_OPENAI_KEY,
openai_api_base=AZURE_OPENAI_ENDPOINT,
openai_api_version=AZURE_OPENAI_API_VERSION,
openai_api_type='azure',
deployment_name='gpt-35-turbo',
temperature=0.0,
max_tokens=500
)
chain = RetrievalQAWithSourcesChain.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vector_store.as_retriever(),
return_source_documents=True
)
return chain({'question': question})
def main():
QUESTION = 'Tell me about effective prompting strategies'
# Search for documents with Azure Cognitive Search
search_results = search_documents(QUESTION)
file_content = extract_documents(search_results)
print('Total Documents Found: {}, Top Documents: {}'.format(
search_results['@odata.count'], len(search_results['value'])))
# 'chunks' is the value that corresponds to the Pages field that you set up in the Cognitive Search service.
# Find the number of chunks
docs = []
for key,value in file_content.items():
for page in value['chunks']:
docs.append(Document(page_content = page,
metadata={"source": value["file_name"]}))
print("Number of chunks: ", len(docs))
# Create an embedding model
embeddings = create_embeddings()
# Create a Vectorstore
vector_store = create_vector_store(docs, embeddings)
# Search results using Langchain and Chatgpt
result = search_with_langchain(vector_store, QUESTION)
print('Question: ', QUESTION)
print('Answer: ', result['answer'])
print('Reference: ', result['sources'].replace(",","\n"))
# execute the main function
if __name__ == "__main__":
main()
Congratulations!
In this tutorial, we have navigated through a practical journey of integrating Azure Blob Storage, Azure Cognitive Search, and Azure OpenAI to create a powerful search and response mechanism.