Bringing GenAI Offline: running SLM’s like Phi-2/Phi-3 and Whisper Models on Mobile Devices

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

In today's digitally interconnected landscape, language models stand at the forefront of technological innovation, reshaping the way we engage with various platforms and applications. These sophisticated algorithms have become indispensable tools in tasks ranging from text generation to natural language processing, driving efficiency and productivity across diverse sectors.

Yet, the reliance on cloud-based solutions presents a notable obstacle in certain contexts. In environments characterized by limited internet connectivity or stringent data privacy regulations, accessing cloud services may prove impractical or even impossible. This dependency on external servers introduces latency issues, security concerns, and operational challenges that hinder the seamless integration of language models into everyday workflows.

Enter the solution: running language models offline. By bringing the computational power of sophisticated models like phi2/3 and Whisper directly to mobile devices, this approach circumvents the constraints of cloud reliance, empowering users to leverage advanced language processing capabilities irrespective of connectivity status.

In this blog, we delve into the significance of enabling offline capabilities for LLMs and explore the practicalities of running SLMs on mobile devices, offering insights into the transformative potential of this technology.

How LLM's are deployed today?

In a typical Large Language Model (LLM) deployment scenario, the LLM is hosted on a public cloud infrastructure like Microsoft Azure using tools like Azure Machine Learning and exposed as an API endpoint. This API serves as the interface through which external applications, such as Web Applications, mobile apps on Android and iOS devices, interact with the LLM to perform natural language processing tasks. When a user initiates a request through the mobile app, the app sends a request to the API endpoint using data, specifying the desired task, such as text generation or sentiment analysis.

The API processes the request, utilizing the LLM to perform the required task, and returns the result to the mobile app. This architecture enables seamless integration of LLM capabilities into mobile applications, allowing users to leverage advanced language processing functionalities directly from their devices while offloading the computational burden to the cloud infrastructure.

To overcome the limitations of relying on internet connectivity and ensure users have the flexibility and ease to interact with their safety copilot even in remote locations or locations where internet isn’t available like basements or underground facilities while safeguarding privacy, the optimal solution is to run Large Language Models (LLMs) on-device, offline. By deploying LLMs directly on users' devices, such as mobile phones and tablets, we eliminate the need for continuous internet access and the associated back-and-forth communication with remote servers. This approach empowers users to access their safety copilot anytime, anywhere, without dependency on network connectivity.

What are Small Language Models (SLMs) ?

Small Language Models (SLMs) represent a focused subset of artificial intelligence tailored for specific enterprise needs within Natural Language Processing (NLP). Unlike their larger counterparts like GPT-4, SLMs prioritize efficiency and precision over sheer computational power. They are trained on domain-specific datasets, enabling them to navigate industry-specific terminologies and nuances with accuracy. In contrast to Large Language Models (LLMs), which may lack customization for enterprise contexts, SLMs offer targeted, actionable insights while minimizing inaccuracies and the risk of generating irrelevant information. SLMs are characterized by their compact architecture, lower computational demands, and enhanced security features, making them cost-effective and adaptable for real-time applications like chatbots. Overall, SLMs provide tailored efficiency, enhanced security, and lower latency, addressing specific business needs effectively while offering a promising alternative to the broader capabilities of LLMs.

Small Language Models (SLMs) offer enterprises control and customization, efficient resource usage, effective performance, swift training and inference, and resource-efficient deployment. They scale easily, adapt to specific domains, facilitate rapid prototyping, enhance security, and provide transparency. SLMs also have clear limitations and offer cost efficiency, making them an attractive option for businesses seeking AI capabilities without extensive resource investment.

Screenshot 2024-05-01 at 3.32.15 PM.png

Why running SLM's offline at edge is a challenge?

Running small language models (SLMs) offline on mobile phones enhances privacy, reduces latency, and promotes access. Users can interact with llm-based applications, receive critical information, and perform tasks even in offline environments, ensuring accessibility and control over personal data. Real-time performance and independence from centralized infrastructure unlock new opportunities for innovation in mobile computing, offering a seamless and responsive user experience. However, running SLMs offline on mobile phones presents several challenges due to the constraints of mobile hardware and the complexities of running LLM tasks. Here are some key challenges:

Limited Processing Power: Mobile devices, especially smartphones, have limited computational resources compared to desktop computers or servers. SLMs often require significant processing power to execute tasks such as text generation or sentiment analysis, which can strain the capabilities of mobile CPUs and GPUs.
Memory Constraints: SLMs typically require a significant amount of memory to store model parameters and intermediate computations. Mobile devices have limited RAM compared to desktops or servers, making it challenging to load and run large language models efficiently.
Battery Life Concerns: Running resource-intensive tasks like NLP on mobile devices can drain battery life quickly. Optimizing SLMs for energy efficiency is crucial to ensure that offline usage remains practical without significantly impacting battery performance.
Storage Limitations: Storing large language models on mobile devices can be problematic due to limited storage space. Balancing the size of the model with the available storage capacity while maintaining performance is a significant challenge.
Update and Maintenance: Keeping SLMs up to date with the latest improvements and security patches presents challenges for offline deployment on mobile devices. Ensuring seamless updates while minimizing data usage and user inconvenience requires careful planning and implementation.
Real-Time Performance: Users expect responsive performance from mobile applications, even when running complex NLP tasks offline. Optimizing SLMs for real-time inference on mobile devices is crucial to provide a smooth user experience.

How to deploy SLMs on Mobile Device?

Deploying Large Language Models (LLMs) on mobile devices involves a sophisticated integration of MediaPipe and WebAssembly technologies to optimize performance and efficiency. MediaPipe, renowned for its on-device ML capabilities, provides a robust framework for running LLMs entirely on mobile devices, thereby eliminating the need for constant network connectivity and offloading computation to remote servers. With the experimental MediaPipe LLM Inference API, developers can seamlessly integrate popular LLMs like Gemma, Phi 2, Falcon, and Stable LM into their mobile applications. This breakthrough is facilitated by a series of optimizations across the on-device stack, including the integration of new operations, quantization techniques, caching mechanisms, and weight sharing strategies. MediaPipe leverages WebAssembly (Wasm) to further enhance the deployment of LLMs on mobile devices.

Wasm's compact binary format and compatibility with multiple programming languages ensure efficient execution of non-JavaScript code within the mobile environment. By time-slicing GPU access and ensuring platform neutrality, Wasm optimizes GPU usage and facilitates seamless deployment across diverse hardware environments, thus enhancing the performance of LLMs on mobile devices. Additionally, advances such as the WebAssembly Systems Interface – Neural Networks (WASI-NN) standard enhance Wasm's capabilities, promising a future where it plays a pivotal role in democratizing access to AI-grade compute power on mobile devices. Through the synergistic utilization of MediaPipe and WebAssembly, developers can deploy LLMs on mobile devices with unprecedented efficiency and performance, revolutionizing on-device AI applications across various platforms.

Mediapipe's LLM Inference API empowers you to harness the power of large language models (LLMs) directly on your Android device. With this tool, you can execute various tasks like text generation, natural language information retrieval, and document summarization without relying on external servers. It offers seamless integration with multiple text-to-text LLMs, enabling you to leverage cutting-edge generative AI models within your Android applications, with support for popular SLM's like Phi-2, Gemma, Falcon-RW-1B, and StableLM-3B.

Screenshot-2024-04-21-at-10.45.17-AM.png

The LLM Inference API uses the com.google.mediapipe:tasks-genai library. Add this dependency to the build.gradle file of your Android app:

dependencies {
    implementation 'com.google.mediapipe:tasks-genai:0.10.11'
}

Convert model to MediaPipe format

he model conversion process requires the MediaPipe PyPI package. The conversion script is available in all MediaPipe packages after 0.10.11.

Install and import the dependencies with the following:

$ python3 -m pip install mediapipe

Use the genai.converter library to convert the model:

import mediapipe as mp
from mediapipe.tasks.python.genai import converter
def phi2_convert_config(backend):
 input_ckpt = '/content/phi-2'
 vocab_model_file = '/content/phi-2/'
 output_dir = '/content/intermediate/phi-2/'
 output_tflite_file = f'/content/converted_models/phi2_{backend}.bin'
 return converter.ConversionConfig(input_ckpt=input_ckpt, ckpt_format='safetensors', model_type='PHI_2', backend=backend, output_dir=output_dir, combine_file_only=False, vocab_model_file=vocab_model_file, output_tflite_file=output_tflite_file)

Parameter	Description	Accepted Values
`input_ckpt`	The path to the `model.safetensors` or `pytorch.bin` file. Note that sometimes the model safetensors format are sharded into multiple files, e.g. `model-00001-of-00003.safetensors`, `model-00001-of-00003.safetensors`. You can specify a file pattern, like `model*.safetensors`.	PATH
`ckpt_format`	The model file format.	{"safetensors", "pytorch"}
`model_type`	The LLM being converted.	{"PHI_2", "FALCON_RW_1B", "STABLELM_4E1T_3B", "GEMMA_2B"}
`backend`	The processor (delegate) used to run the model.	{"cpu", "gpu"}
`output_dir`	The path to the output directory that hosts the per-layer weight files.	PATH
`output_tflite_file`	The path to the output file. For example, "model_cpu.bin" or "model_gpu.bin". This file is only compatible with the LLM Inference API, and cannot be used as a general `tflite` file.	PATH
`vocab_model_file`	The path to the directory that stores the `tokenizer.json` and `tokenizer_config.json` files. For Gemma, point to the single `tokenizer.model` file.	PATH

Push model to the device

Push the content of the output_path folder to the Android device.

$ adb shell rm -r /data/local/tmp/llm/ # Remove any previously loaded models
$ adb shell mkdir -p /data/local/tmp/llm/
$ adb push model.bin /data/local/tmp/llm/model_phi2.bin.bin

Create the task

The MediaPipe LLM Inference API uses the createFromOptions() function to set up the task. The createFromOptions() function accepts values for the configuration options. For more information on configuration options, see Configuration options.

The following code initializes the task using basic configuration options:

// Set the configuration options for the LLM Inference task
val options = LlmInferenceOptions.builder()
        .setModelPATH('/data/local/.../')
        .setMaxTokens(1000)
        .setTopK(40)
        .setTemperature(0.8)
        .setRandomSeed(101)
        .build()

// Create an instance of the LLM Inference task
llmInference = LlmInference.createFromOptions(context, options)

Configuration options

Use the following configuration options to set up an Android app:

Option Name	Description	Value Range	Default Value
`modelPath`	The path to where the model is stored within the project directory.	PATH	N/A
`maxTokens`	The maximum number of tokens (input tokens + output tokens) the model handles.	Integer	512
`topK`	The number of tokens the model considers at each step of generation. Limits predictions to the top k most-probable tokens. When setting `topK`, you must also set a value for `randomSeed`.	Integer	40
`temperature`	The amount of randomness introduced during generation. A higher temperature results in more creativity in the generated text, while a lower temperature produces more predictable generation. When setting `temperature`, you must also set a value for `randomSeed`.	Float	0.8
`randomSeed`	The random seed used during text generation.	Integer	0
`resultListener`	Sets the result listener to receive the results asynchronously. Only applicable when using the async generation method.	N/A	N/A
`errorListener`	Sets an optional error listener.	N/A	N/A

Prepare data

The LLM Inference API accepts the following inputs:

prompt (string): A question or prompt.

val inputPrompt = "Compose an email to remind Brett of lunch plans at noon on Saturday."

Run the task

Use the generateResponse() method to generate a text response to the input text provided in the previous section (inputPrompt). This produces a single generated response.

val result = llmInference.generateResponse(inputPrompt)
logger.atInfo().log("result: $result")

To stream the response, use the generateResponseAsync() method.

val options = LlmInference.LlmInferenceOptions.builder()
  ...
  .setResultListener { partialResult, done ->
    logger.atInfo().log("partial result: $partialResult")
  }
  .build()

llmInference.generateResponseAsync(inputPrompt)

Handle and display results

The LLM Inference API returns a LlmInferenceResult, which includes the generated response text.

Here's a draft you can use:

Subject: Lunch on Saturday Reminder

Hi Brett,

Just a quick reminder about our lunch plans this Saturday at noon.
Let me know if that still works for you.

Looking forward to it!

Best,
[Your Name]