Getting Started – Generative AI with Phi-3-mini: Running Phi-3-mini in Intel AI PC

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

In 2024, with the empowerment of AI, we will enter the era of AI PC. On May 20, Microsoft also released the concept of Copilot + PC, which means that PC can run SLM/LLM more efficiently with the support of NPU. We can use models from different Phi-3 family combined with the new AI PC to build a simple personalized Copilot application for individuals. This content will combine Intel's AI PC, use Intel's OpenVINO, NPU Acceleration Library, and Microsoft's DirectML to create a local Copilot An on-demand recording of Microsoft Copilot +PC event from the May 20 event is available..

Introduce Phi-3 Family

Phi-3-Mini is a Transformer-based language model with 3.8 billion parameters. The Phi-3-Mini model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model belongs to the Phi-3 family with the Mini version in two variants 4K and 128K which is the context length (in tokens) that it can support.

Phi-3-mini is a 3.8B parameter language model, available in two context lengths 128K and 4K.

Phi-3-Small is a Transformer-based language model with 7 billion parameters. The Phi-3-Small model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. Phi-3-Small is also trained more intensively on multilingual datasets compared to Phi-3-Mini. The model family has two variants 8K and 128K which is the context length (in tokens) that it can support.

Phi-3-small is a 7B parameter language model, available in two context lengths 128K and 8K.

Phi-3-Medium is a Transformer-based language model with 14 billion parameters. The Phi-3-Medium model was trained using high quality data which contain educational useful information augmented with new data sources that consist of various NLP synthetic texts and both internal and external chat datasets which significantly improves chat capabilities. The model family has two variants 4K and 128K which is the context length (in tokens) that it can support.

Phi-3-medium is a 14B parameter language model, available in two context lengths 128K and 4K.

Phi-3-Vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

The Phi-3-vision is a 4.2B parameter multimodal model with language and vision capabilities.

For suitable models for AI PC, I personally recommend Phi-3-mini . As for Phi-3-small, Phi-3 Vision and Phi-3-medium, they are more suitable for running on Nvidia CUDA devices.

What's NPU

An NPU (Neural Processing Unit) is a dedicated processor or processing unit on a larger SoC designed specifically for accelerating neural network operations and AI tasks. Unlike general-purpose CPUs and GPUs, NPUs are optimized for a data-driven parallel computing, making them highly efficient at processing massive multimedia data like videos and images and processing data for neural networks. They are particularly adept at handling AI-related tasks, such as speech recognition, background blurring in video calls, and photo or video editing processes like object detection.

NPU vs GPU

While many AI and machine learning workloads run on GPUs, there’s a crucial distinction between GPUs and NPUs. GPUs are known for their parallel computing capabilities, but not all GPUs are equally efficient beyond processing graphics. NPUs, on the other hand, are purpose-built for complex computations involved in neural network operations, making them highly effective for AI tasks.

In summary, NPUs are the math whizzes that turbocharge AI computations, and they play a key role in the emerging era of AI PCs!

This example is based on Intel’s latest Intel Core Ultra Processor

1. Use NPU to run Phi-3 model

Intel® NPU device is an AI inference accelerator integrated with Intel client CPUs, starting from Intel® Core™ Ultra generation of CPUs (formerly known as Meteor Lake). It enables energy-efficient execution of artificial neural network tasks.

Intel NPU Acceleration Library

The Intel NPU Acceleration Library https://github.com/intel/intel-npu-acceleration-library is a Python library designed to boost the efficiency of your applications by leveraging the power of the Intel Neural Processing Unit (NPU) to perform high-speed computations on compatible hardware.

Install the Python Library with pip


   pip install intel-npu-acceleration-library

Note The project is still under development, but the reference model is already very complete.

Running Phi-3 with Intel NPU Acceleration Library

Using Intel NPU acceleration, this library does not affect the traditional encoding process. You only need to use this library to quantize the original Phi-3 model, such as FP16, INT4:


from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM,pipeline
import intel_npu_acceleration_library
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", use_cache=True,trust_remote_code=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("Compile model for the NPU")
model = intel_npu_acceleration_library.compile(model, dtype=torch.float16)

After the quantification is successful, continue execution to call the NPU to run the Phi-3 model.


pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

query = "<|system|>You are a helpful AI assistant.<|end|><|user|>Can you introduce yourself?<|end|><|assistant|>"

output = pipe(query, **generation_args)

output[0]['generated_text']

When executing code, we can view the running status of the NPU through Task Manager

2. Use DirectML + ONNX Runtime to run Phi-3 Model

What is DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.

Note : The latest DirectML already supports NPU(https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-developer-preview/)

DirectML and CUDA in terms of their capabilities and performance:

DirectML is a machine learning library developed by Microsoft. It is designed to accelerate machine learning workloads on Windows devices, including desktops, laptops, and edge devices.

DX12-Based: DirectML is built on top of DirectX 12 (DX12), which provides a wide range of hardware support across GPUs, including both NVIDIA and AMD.
Wider Support: Since it leverages DX12, DirectML can work with any GPU that supports DX12, even integrated GPUs.
Image Processing: DirectML processes images and other data using neural networks, making it suitable for tasks like image recognition, object detection, and more.
Ease of Setup: Setting up DirectML is straightforward, and it doesn’t require specific SDKs or libraries from GPU manufacturers.
Performance: In some cases, DirectML performs well and can be faster than CUDA, especially for certain workloads.
Limitations: However, there are instances where DirectML may be slower, particularly for float16 large batch sizes.

CUDA is NVIDIA’s parallel computing platform and programming model. It allows developers to harness the power of NVIDIA GPUs for general-purpose computing, including machine learning and scientific simulations.

NVIDIA-Specific: CUDA is tightly integrated with NVIDIA GPUs and is specifically designed for them.
Highly Optimized: It provides excellent performance for GPU-accelerated tasks, especially when using NVIDIA GPUs.
Widely Used: Many machine learning frameworks and libraries (such as TensorFlow and PyTorch) have CUDA support.
Customization: Developers can fine-tune CUDA settings for specific tasks, which can lead to optimal performance.
Limitations: However, CUDA’s dependency on NVIDIA hardware can be limiting if you want broader compatibility across different GPUs.

Choosing Between DirectML and CUDA:

The choice between DirectML and CUDA depends on your specific use case, hardware availability, and preferences. If you’re looking for broader compatibility and ease of setup, DirectML might be a good choice. However, if you have NVIDIA GPUs and need highly optimized performance, CUDA remains a strong contender. In summary, both DirectML and CUDA have their strengths and weaknesses, so consider your requirements and available hardware when making a decision

Generative AI with ONNX Runtime

In the era of AI , the portability of AI models is very important. ONNX Runtime can easily deploy trained models to different devices. Developers do not need to pay attention to the inference framework and use a unified API to complete model inference. In the era of generative AI, ONNX Runtime has also performed code optimization (https: //onnxruntime.ai/docs/genai/). Through the optimized ONNX Runtime, the quantized generative AI model can be inferred on different terminals. In Generative AI with ONNX Runtime, you can inferene AI model API through Python, C#, C / C++. of course,Deployment on iPhone can take advantage of C++'s Generative AI with ONNX Runtime API.

compile generative AI with ONNX Runtime library


winget install --id=Kitware.CMake  -e

git clone https://github.com/microsoft/onnxruntime.git

cd .\onnxruntime\

./build.bat --build_shared_lib --skip_tests --parallel --use_dml --config Release

cd ../

git clone https://github.com/microsoft/onnxruntime-genai.git

cd .\onnxruntime-genai\

mkdir ort

cd ort

mkdir include

mkdir lib

copy ..\onnxruntime\include\onnxruntime\core\providers\dml\dml_provider_factory.h ort\include

copy ..\onnxruntime\include\onnxruntime\core\session\onnxruntime_c_api.h ort\include

copy ..\onnxruntime\build\Windows\Release\Release\*.dll ort\lib

copy ..\onnxruntime\build\Windows\Release\Release\onnxruntime.lib ort\lib

python build.py --use_dml

Install library


pip install .\onnxruntime_genai_directml-0.3.0.dev0-cp310-cp310-win_amd64.whl

This is running result

3. Use Intel OpenVino to run Phi-3 Model

What is OpenVINO

OpenVINO is an open-source toolkit for optimizing and deploying deep learning models. It provides boosted deep learning performance for vision, audio, and language models from popular frameworks like TensorFlow, PyTorch, and more. Get started with OpenVINO.OpenVINO can also be used in combination with CPU and GPU to run the Phi3 model.

Note: Currently, OpenVINO does not support NPU at this time.

Install OpenVINO Library


 pip install git+https://github.com/huggingface/optimum-intel.git

 pip install git+https://github.com/openvinotoolkit/nncf.git

 pip install openvino-nightly

Running Phi-3 with OpenVINO

Like NPU, OpenVINO completes the call of generative AI models by running quantitative models. We need to quantize the Phi-3 model first and complete the model quantization on the command line through optimum-cli

INT4


optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format int4 --group-size 128 --ratio 0.6  --sym  --trust-remote-code ./openvinomodel/phi3/int4

FP16


optimum-cli export openvino --model "microsoft/Phi-3-mini-4k-instruct" --task text-generation-with-past --weight-format fp16 --trust-remote-code ./openvinomodel/phi3/fp16

the converted format , like this

Load model paths(model_dir), related configurations(ov_config = {"PERFORMANCE_HINT": "LATENCY", "NUM_STREAMS": "1", "CACHE_DIR": ""}), and hardware-accelerated devices(GPU.0) through OVModelForCausalLM


ov_model = OVModelForCausalLM.from_pretrained(
     model_dir,
     device='GPU.0',
     ov_config=ov_config,
     config=AutoConfig.from_pretrained(model_dir, trust_remote_code=True),
     trust_remote_code=True,
)

When executing code, we can view the running status of the GPU through Task Manager

Note : The above three methods each have their own advantages, but it is recommended to use NPU acceleration for AI PC inference.

Resources

Phi-3 Microsoft Blog https://aka.ms/phi3blog-april
Phi-3 technical report https://aka.ms/phi3-tech-report
Phi-3 Cookbook https://aka.ms/Phi-3CookBook