Building Intelligent Applications with Local RAG in .NET and Phi-3: A Hands-On Guide

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Hi!

In this blog post we will learn how to do Retrieval Augmented Generation (RAG) using local resources in .NET! We’ll show you how to combine the Phi-3 language model, Local Embeddings, and Semantic Kernel to create a RAG scenario.

What is RAG?

Before we dive into the demo, let’s quickly recap what RAG is. RAG is a hybrid approach that enhances the capabilities of a language model by incorporating external knowledge. In example: using a RAG approach we can retrieve relevant documents from a knowledge base and use them to generate more informed and accurate responses. This is particularly useful in scenarios where a LLM needs up-to-date information or specific domain knowledge that isn't contained within its initial training data.

The Components

To build our RAG setup, we’ll be using the following components:

Phi-3: Our local LLM, which is a powerful tool for generating human-like text. Check out the Phi-3 Cookbook for more details.
Smart Components Local Embeddings: This package will help us create embeddings, which are numerical representations of text that capture its semantic meaning. You can learn more about it in the Smart Components Local Embeddings documentation.
Semantic Kernel: This acts as the main orchestrator, integrating Phi-3 and Smart Components to create a seamless RAG pipeline. Visit the Semantic Kernel GitHub page for more information.

Demo Scenario

The demo scenario below is designed to answer a specific question, “What is Bruno’s favourite super hero?“, using two different approaches.

Ask the question directly to the Phi-3 model. The model will answer declining a response, Phi-3 does not talk about Bruno.
Ask the question to the Phi-3 model, and add a semantic memory object with fan facts loaded. Now the response will be based on the semantic memory content.

This is the app running:

2024 06 17 RAG Phi3.gif

Code Sample

Let’s jump to the code. The code below is a C# console application that demonstrates the use of a local model hosted in Ollama and semantic memory for search.

Here’s a step-by-step breakdown of the program:

The program starts by defining the question and announcing the two approaches it will use to answer it. The first approach is to ask the question directly to the Phi-3 model, and the second approach is to add facts to a semantic memory and ask the question again.
The program creates a chat completion service using the Kernel.CreateBuilder() method. It adds Chat Completion using a local model, and local text embedding generation to the builder, then builds the kernel.
The program then asks the question directly to the Phi-3 model and prints the response.
The program gets the embeddings generator service and creates a new semantic text memory with a volatile memory store and the embedding generator.
The program adds facts to the memory collection. These facts are about Bruno and Gisela’s favourite super heroes and the last super hero movies they watched.
The program creates a new text memory plugin with the semantic text memory and imports the plugin into the kernel.
The program sets up the prompt execution settings and the kernel arguments, which include the question and the memory collection.
Finally, the program asks the question again, this time using the semantic memory, and prints the response.

The program uses several external libraries, including:

Microsoft.Extensions.Configuration and Microsoft.Extensions.DependencyInjection for dependency injection and configuration.
Microsoft.KernelMemory, Microsoft.SemanticKernel, Microsoft.SemanticKernel.ChatCompletion, Microsoft.SemanticKernel.Connectors.OpenAI, Microsoft.SemanticKernel.Embeddings, Microsoft.SemanticKernel.Memory, and Microsoft.SemanticKernel.Plugins.Memory for the semantic kernel and memory functionalities.

This program is a great example of how AI can be used to answer questions using both direct model querying and semantic memory.

// Copyright (c) 2024 // Author : Bruno Capuano // Change Log : // - Sample console application to use a local model hosted in ollama and semantic memory for search // // The MIT License (MIT) // // Permission is hereby granted, free of charge, to any person obtaining a copy // of this software and associated documentation files (the "Software"), to deal // in the Software without restriction, including without limitation the rights // to use, copy, modify, merge, publish, distribute, sublicense, and/or sell // copies of the Software, and to permit persons to whom the Software is // furnished to do so, subject to the following conditions: // // The above copyright notice and this permission notice shall be included in // all copies or substantial portions of the Software. // // THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR // IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, // FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE // AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER // LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN // THE SOFTWARE. #pragma warning disable SKEXP0001 #pragma warning disable SKEXP0003 #pragma warning disable SKEXP0010 #pragma warning disable SKEXP0011 #pragma warning disable SKEXP0050 #pragma warning disable SKEXP0052 using Microsoft.Extensions.Configuration; using Microsoft.Extensions.DependencyInjection; using Microsoft.KernelMemory; using Microsoft.SemanticKernel; using Microsoft.SemanticKernel.ChatCompletion; using Microsoft.SemanticKernel.Connectors.OpenAI; using Microsoft.SemanticKernel.Embeddings; using Microsoft.SemanticKernel.Memory; using Microsoft.SemanticKernel.Plugins.Memory; var question = "What is Bruno's favourite super hero?"; Console.WriteLine($"This program will answer the following question: {question}"); Console.WriteLine("1st approach will be to ask the question directly to the Phi-3 model."); Console.WriteLine("2nd approach will be to add facts to a semantic memory and ask the question again"); Console.WriteLine(""); // Create a chat completion service var builder = Kernel.CreateBuilder(); builder.AddOpenAIChatCompletion( modelId: "phi3", endpoint: new Uri("http://localhost:11434"), apiKey: "apikey"); builder.AddLocalTextEmbeddingGeneration(); Kernel kernel = builder.Build(); Console.WriteLine($"Phi-3 response (no memory)."); var response = kernel.InvokePromptStreamingAsync(question); await foreach (var result in response) { Console.Write(result); } // separator Console.WriteLine(""); Console.WriteLine("=============="); Console.WriteLine(""); // get the embeddings generator service var embeddingGenerator = kernel.Services.GetRequiredService<ITextEmbeddingGenerationService>(); var memory = new SemanticTextMemory(new VolatileMemoryStore(), embeddingGenerator); // add facts to the collection const string MemoryCollectionName = "fanFacts"; await memory.SaveInformationAsync(MemoryCollectionName, id: "info1", text: "Gisela's favourite super hero is Batman"); await memory.SaveInformationAsync(MemoryCollectionName, id: "info2", text: "The last super hero movie watched by Gisela was Guardians of the Galaxy Vol 3"); await memory.SaveInformationAsync(MemoryCollectionName, id: "info3", text: "Bruno's favourite super hero is Invincible"); await memory.SaveInformationAsync(MemoryCollectionName, id: "info4", text: "The last super hero movie watched by Bruno was Aquaman II"); await memory.SaveInformationAsync(MemoryCollectionName, id: "info5", text: "Bruno don't like the super hero movie: Eternals"); TextMemoryPlugin memoryPlugin = new(memory); // Import the text memory plugin into the Kernel. kernel.ImportPluginFromObject(memoryPlugin); OpenAIPromptExecutionSettings settings = new() { ToolCallBehavior = ToolCallBehavior.AutoInvokeKernelFunctions, }; var prompt = @" Question: {{$input}} Answer the question using the memory content: {{Recall}}"; var arguments = new KernelArguments(settings) { { "input", question }, { "collection", MemoryCollectionName } }; Console.WriteLine($"Phi-3 response (using semantic memory)."); response = kernel.InvokePromptStreamingAsync(prompt, arguments); await foreach (var result in response) { Console.Write(result); } Console.WriteLine($"");

The full source code is available here: Program.cs.

Test This Scenario for Free Using CodeSpaces in the Phi-3 Cookbook

To help you get started with Phi-3 and experience its capabilities firsthand, we are thrilled to introduce the support of Codespaces in the Phi-3 Cookbook.

The C# Ollama Labs are designed to test Phi-3 with C# samples directly in GitHub Codespaces as an easy way for anyone to try out Phi-3 with C# entirely in the browser.

Check the guide here: Phi-3CookBook/md/07.Labs/CsharpOllamaCodeSpaces/CsharpOllamaCodeSpaces.md at main · microsoft/Phi-3CookBook (github.com)

Conclusion

Phi-3, local embeddings and Semantic Kernel are a great combination to support RAG scenarios in local mode.

And using Semantic Kernel is easy to later switch to Azure OpenAI Services to scale at Enterprise level!

Happy Coding!

Bruno Capuano

What is RAG?

The Components

Code Sample

Test This Scenario for Free Using CodeSpaces in the Phi-3 Cookbook

Conclusion

Leave a Reply Cancel reply