Getting Started Using Phi-3-mini-4k-instruct-onnx for Text Generation with NLP Techniques

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

The Phi-3 mini models are AI models. The short context version Phi-3-mini-4k-instruct-onnx has a prompt length of 4k words, while the long context version can accept much longer prompts and produce longer output text.

In this tutorial, we will be using the short context version of the Phi-3 ONNX models ( Phi-3-mini-4k-instruct-onnx) and using the model available from Hugging Face.

Before we begin, it is important to install the git large file system extension and the Hugging Face CLI. These tools are necessary for downloading the ONNX models. Additionally, we will focus this tutorial on using the CPU to run the models. If you have a GPU, you can use DirectML or NVIDIA CUDA GPU setups for optimal performance depending on your operating system.

Setting up your Python Environment

Navigate to your project directory using the cd command.
For example:

cd path/to/your/project

Create a new virtual environment by running the following command:

python -m venv .venv

This will create a .venv directory in your project folder, containing an isolated Python environment.

Activate the virtual environment

On Windows:

.venv\Scripts\activate

On macOS/Linux:

source .venv/bin/activate

You’ll see the virtual environment name in your command prompt (e.g., (venv)). Now you can install Python packages specific to your project without affecting the global Python installation.
Remember to replace <virtual-environment-name> with your preferred name for the virtual environment.

Run Phi-3-mini-4k-instruct-onnx model on CPU or GPU with DirectML or NvidiaCuda

Prerquesties: Install Git Large File System Support

For Windows
First you install some prerequsities
Use the winget tool to install and manage applications | Microsoft Learn

After App Installer is installed, you can run winget by typing 'winget' from a Command Prompt.

winget install -e --id GitHub.GitLFS

For MacOS

brew install git-lfs

For Linux

apt-get install git-lfs

We now need to run the Gif-Lfs

git lfs install

Deploying the Phi-3 model from Hugging Face

Install the Hugging Face CLI

pip install huggingface-hub[cli]

Now were are going to download the Phi-3 model and run this on the device CPU

Dowloading Phi-3 from Hugging Face

Download the Phi-3-mini-4k-instruct-onnx model. Below is the batch script that allows you to download the correct version of the Phi-3 model based on your preference. You can save this script with a .bat extension (e.g., download_phi3_model.bat) and run it:

@echo off
setlocal

REM Select which model to download
echo.
echo Choose an option:
echo 1. Download the Phi-3 Model for CPU
echo 2. Download the Phi-3 Model for Nvidia Cuda
echo 3. Download the Phi-3 Model for DirectML
set /p option=Enter the option number: 

if "%option%"=="1" (
    huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir .
) else if "%option%"=="2" (
    huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include cuda/cuda-int4-rtn-block-32/* --local-dir .
) else if "%option%"=="3" (
    huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx --include directml/* --local-dir .
) else (
    echo Invalid option. Please choose 1, 2, or 3.
)

endlocal

This command downloads the model into a folder called cpu_and_mobile

Below is a batch script that allows the user to select the ONNX runtime installation option. . Save this script with a .bat extension (e.g., install_onnx_runtime.bat) and run it:

@echo off
setlocal

REM Install numpy libary
pip install numpy

REM Pick which ONNX runtime to install
echo.
echo Choose an option:
echo 1. For CPU (onnxruntime-genai)
echo 2. For GPU (onnxruntime-genai-cuda)
echo 3. For DirectML (onnxruntime-genai-directml)
set /p option=Enter the option number: 

if "%option%"=="1" (
    pip install --pre onnxruntime-genai
) else if "%option%"=="2" (
    pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
) else if "%option%"=="3" (
    pip install --pre onnxruntime-genai-directml
) else (
    echo Invalid option. Please choose 1, 2, or 3.
)

endlocal

Run the model using a Python Script and switch command for model selection

import onnxruntime_genai as og  
import argparse
import time

def main(args):
    # If verbose mode is on, print loading model message
    if args.verbose: print("Loading model...")
    
    # If timings mode is on, initialize timing variables
    if args.timings:
        started_timestamp = 0
        first_token_timestamp = 0

    # Load the model
    model = og.Model(f'{args.model}')
    if args.verbose: print("Model loaded")
    
    # Initialize the tokenizer with the model
    tokenizer = og.Tokenizer(model)
    tokenizer_stream = tokenizer.create_stream()
    if args.verbose: print("Tokenizer created")
    
    # Print a newline for readability if verbose mode is on
    if args.verbose: print()
    
    # Create a dictionary of search options from the command line arguments
    search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args}
    
    # Set a default max length if one is not provided
    if 'max_length' not in search_options:
        search_options['max_length'] = 2048

    # Define a template for the chat input
    chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

    # Main loop: ask for input and generate responses
    while True:
        # Get user input
        text = input("Input: ")
        
        # If the input is empty, print an error message and continue to the next iteration
        if not text:
            print("Error, input cannot be empty")
            continue

        # If timings mode is on, record the start time
        if args.timings: started_timestamp = time.time()

        # Format the input with the chat template
        prompt = f'{chat_template.format(input=text)}'

        # Tokenize the input
        input_tokens = tokenizer.encode(prompt)

        # Set up the generator parameters
        params = og.GeneratorParams(model)
        params.try_use_cuda_graph_with_max_batch_size(1)
        params.set_search_options(**search_options)
        params.input_ids = input_tokens
        
        # Create the generator
        generator = og.Generator(model, params)
        if args.verbose: print("Generator created")

        # Print a message if verbose mode is on
        if args.verbose: print("Running generation loop ...")
        
        # If timings mode is on, initialize variables for the generation loop
        if args.timings:
            first = True
            new_tokens = []

        # Print the output prompt
        print()
        print("Output: ", end='', flush=True)

If you do install the requirements for DirectML, Cuda and CPU support you can run the Python file above with the following switch

For CPU

python filename.py -m cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

For DirectML

python filename.py -m directml\directml-int4-awq-block-128

For Cuda

python filename.py -m cuda/cuda-int4-rtn-block-32

Running this a simple batch file

Below is the runnable Python script based on your provided code. You can save this script to a .py file and execute it. Make sure to replace --model with the actual path to your ONNX model file. You can run this script using python your_script_name.py

import onnxruntime_genai as og
import argparse
import time

def main(args):
    # If verbose mode is on, print loading model message
    if args.verbose: print("Loading model...")
    
    # If timings mode is on, initialize timing variables
    if args.timings:
        started_timestamp = 0
        first_token_timestamp = 0

    # Load the model
    model = og.Model(f'{args.model}')
    if args.verbose: print("Model loaded")
    
    # Initialize the tokenizer with the model
    tokenizer = og.Tokenizer(model)
    tokenizer_stream = tokenizer.create_stream()
    if args.verbose: print("Tokenizer created")
    
    # Print a newline for readability if verbose mode is on
    if args.verbose: print()
    
    # Create a dictionary of search options from the command line arguments
    search_options = {name:getattr(args, name) for name in ['do_sample', 'max_length', 'min_length', 'top_p', 'top_k', 'temperature', 'repetition_penalty'] if name in args}
    
    # Set a default max length if one is not provided
    if 'max_length' not in search_options:
        search_options['max_length'] = 2048

    # Define a template for the chat input
    chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

    # Main loop: ask for input and generate responses
    while True:
        # Get user input
        text = input("Input: ")
        
        # If the input is empty, print an error message and continue to the next iteration
        if not text:
            print("Error, input cannot be empty")
            continue

        # If timings mode is on, record the start time
        if args.timings: started_timestamp = time.time()

        # Format the input with the chat template
        prompt = f'{chat_template.format(input=text)}'

        # Tokenize the input
        input_tokens = tokenizer.encode(prompt)

        # Set up the generator parameters
        params = og.GeneratorParams(model)
        params.try_use_cuda_graph_with_max_batch_size(1)
        params.set_search_options(**search_options)
        params.input_ids = input_tokens
        
        # Create the generator
        generator = og.Generator(model, params)
        if args.verbose: print("Generator created")

        # Print a message if verbose mode is on
        if args.verbose: print("Running generation loop ...")
        
        # If timings mode is on, initialize variables for the generation loop
        if args.timings:
            first = True
            new_tokens = []

        # Print the output prompt
        print()
        print("Output: ", end='', flush=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run the chatbot script")
    parser.add_argument("--model", type=str, required=True, help="Path to the ONNX model file")
    parser.add_argument("--verbose", action="store_true", help="Enable verbose mode")
    parser.add_argument("--timings", action="store_true", help="Enable timings mode")
    args = parser.parse_args()
    main(args)

In conclusion, the Phi-3 mini models are powerful AI tools for text generation using NLP techniques. These models can be run on a variety of devices, including GPUs and CPUs. By following the instructions in this tutorial, you can easily download and run these models on your own computer.

Run Phi-3-mini-4k-instruct-onnx model on CPU or GPU with DirectML or NvidiaCuda

Leave a Reply Cancel reply