Distributing big AI models in mobile apps

This post has been republished via RSS; it originally appeared at: Microsoft Mobile Engineering - Medium.

This is based on a talk given by Beatriz Viñal Murciano (LinkedIn) at droidcon SF (June 2024, video, slides) and Víctor Julián García Granado (LinkedIn) at droidcon Lisbon (September 2024, video) about how we use AI models in our app.

The SwiftKey team is one of the many teams working on Android at Microsoft. We build the Microsoft SwiftKey AI Keyboard app for Android (“SwiftKey” from now on).

The SwiftKey Keyboard in action typing “Thank you for being here today!” by sliding a finger over the keys.

SwiftKey is a keyboard app that lets you type by tapping on keys or sliding your finger over them, correcting and predicting what you type and learning from your typing. It supports over 700 languages (up to 6 of them at the same time) and it has an ever-increasing number of other features such as emoji, GIFs, clipboard and integration of new generative AI products like Copilot and Designer.

As we evolve the product, we’ve made an effort to keep the app size small. SwiftKey’s download size is currently around 20MB and its install size right after installation is around 55MB, although it increases over time.

Size matters to our users and partners, who prefer apps that are as small as possible. It matters to us too, because apart from our wish to respect users’ data and storage, if the app is bigger users are more likely to uninstall it and partnering with device manufactures is more difficult. We discussed app size at length in App size reduction at Microsoft SwiftKey.

We’ve noticed that an increasing number of the latest feature proposals include some form of AI, i.e. they require some sort of probabilistic model that takes some data as input, does some processing, and returns some data as output.

Slide with title “More AI features” and text “An increasing number of new features are AI-based and require big models and lots of processing”. There are diagrams for the main alternatives: in the cloud (the AI feature is in a cloud outside a phone) and on-device (the AI feature is inside a phone).

This AI is represented by a green box called “AI feature” in the diagrams to simplify. The box can include one or more models, the engine that processes the models and other related data they need to work as well, which can get quite big (usually due to the model(s) size) and may need to be handled carefully to avoid unnecessarily increasing the app size.

How big does an AI feature need to be to need careful handling? Each team will need to decide what they consider to be a significant increase to each of their app’s size. The thought process to decide how to implement features is the same regardless of the feature size.

We’ve thought about how to handle features that are too big for us (~5% of our download size or bigger) a few times and it usually comes down to the same trade-offs. We’d like to share them with you so you can make more informed decisions if you find yourselves in similar situations.

AI in the cloud vs on device

When we have to use one of these models, there are two main alternatives regarding where the model can be: in the cloud or on device.

Using a model that is in the cloud is straightforward. The service is hosted somewhere. We make a request with some data, the data is processed in the cloud, and we receive a response with some other data.

When we use a model on-device, the data is processed on the device and never leaves the device. That’s more complicated if we’re using our own model because we need to manage that model in the Android app and we’ll focus on that later.

First, let’s look at the differences between the cloud and the on-device approaches.

There is no good or bad option. We need to understand the implications of each option to decide which one is more suitable for our needs.

In the cloud

As said, when we work with a model that is in the cloud, the app behaviour is simple: we make a request with some data and we wait for a response with some other data. Using a cloud endpoint is straightforward for app developers. The complexity is in the cloud and we don’t need to worry about it from the app’s perspective.

The advantages of having the model in the cloud are that:

There can be more storage and computing power available in the cloud, so bigger and more complex models can run in a reasonable amount of time
The app size can stay small, given that the model is never downloaded to the device
The model can be updated any time, for all users at once

The problems with this approach are that:

Privacy can be more difficult to achieve, depending on what we’re doing with data, because the data leaves the device. We need to think about what data we are sending, whether it contains personal data, what data will be stored in the cloud, who will have access to that data, how long it will be stored for, how we comply with regulations, etc.
There’s a cost per API usage, so it can be very expensive to make frequent requests
And this might be obvious, but if an app needs data that is in a service somewhere, it requires an internet connection and it requires the service to work

This makes hosting models in the cloud a great choice if models won’t fit or won’t work in a device (in which case it’s the only choice) and for features that aren’t frequently used (to keep costs low) where latency is acceptable. We’re talking about models of tens of GBs or bigger.

Slide with title “Cloud example: sticker generation” and images that show how users enter the text that describes the sticker they’d like in the UI and stickers for that description are created.

This is an approach we’re using with most generative AI features, for example our sticker generation feature that you can use to create stickers like these.

The image shows how it works. You enter a prompt such as “Android dancing”, you wait while stickers are being created, and you get some stickers where Androids are dancing that you can send to other people or save for later.

The feature uses DALL·E 3 under the hood, which currently doesn’t fit on a device so it has to be in the cloud. It’s not used very frequently, since most of the time people use a keyboard to type and once in a while they’ll need to generate a custom sticker, which makes the costs manageable. It’s also a process where everyone expects some latency, because we have lots of experience with waiting for searches and content generation. We’ve all seen a spinner before and we understand it — it’s all good as long as it keeps making progress.

Slide with title “More cloud examples: Designer and tone” that shows other generative AI features that let users generate images from a prompt and change the tone of the text they’ve written.

Generative AI features tend to use this approach. Here are two more:

Designer, where you enter a prompt for the image you want and it’s generated for you (similar to stickers but for any kind of image), e.g. images for “a Droidcon SF talk”.
Tone change, where you type a sentence and it’s rewritten into different styles, e.g. I typed “Droidcon SF is great!”, I asked for a professional tone and I got “I find Droidcon SF to be an excellent event”.

On-device

When the model is on the device, well, it’s on the device and it can be queried locally. What were advantages of models in the cloud turn into disadvantages and vice versa.

The pros of models on device are:

There’s no money cost per API usage
The model is available all the time — no internet or service required
Privacy may be easier to guarantee because the data doesn’t leave the device, but it depends on what you do with data
Model updates will vary depending on the implementation, as we’ll see in the next section

And the cons are:

Size and computing power are limited (although they keep getting better)
The app size will generally be bigger

Generally speaking, models on device are more suitable when models are small and fast, used frequently, and when performance matters. We’re talking about models of tens or hundreds of MBs, which aren’t huge for industry standards but are pretty big for our app (remember ~20MB download size, ~55MB install size).

There are different options for distributing our own models in Android apps and we’ll go over them in the next section, with more examples.

A note on latency

We’ve said the cloud is more suitable when latency is acceptable and on-device when performance matters.

However, that doesn’t mean on-device is always faster.

When we do an operation in the cloud, we need some time for the request to reach the server and for the response to come back, represented by purple blocks in the slide. However, there’s more processing power in the cloud, so transforming the input into an output can be quicker.

This will all depend on implementations and there are too many nuances to consider. The fact that there can be more processing power in the cloud doesn’t mean that it is available for you, but let’s assume that it is for the sake of the argument.

In general, if an operation is very quick, the time of sending a request to do it more quickly in the cloud and receiving a response may be bigger that doing it on device, even if the processing time is longer.

But if an operation is very long, the time of sending a request and getting a response will be small compared to the time processing the operation takes, and using a cloud model can be quicker.

As usual, you need to look into the specifics of the options available to you to consider what will be quicker. From the app’s point of view, if the model is on the cloud it’ll have to do a network request but no processing and if the model is on-device it’ll have to do no network request but some processing. You need to be aware of what’s going on and see what meets your needs, in terms of how long an operation takes and how much battery it uses. If you wanted to consider all energy usage, you’d also need to take into account the energy the server uses, but that’s complicated and we’re not going to get into that.

Summary

Slide with title “In the cloud vs on-device” that summarises the content considering feasibility, cost, latency, availability, app size, privacy, model updates, and suitability. The whole content doesn’t fit here due to Medium’s alt text limitations.

If the model is too big or too complex, it’ll have to be in the cloud. If you have a choice, you need to check other restrictions to make a decision: Can you afford to have it in the cloud? What latency/availability/app size increase/way of updating models is acceptable?

AI on device

Using a cloud endpoint is straight-forward for app developers because the complexity is outside the app. Let’s look at what happens when the model is on-device and we deal with more complexity ourselves.

Slide with title “On-device: Where does the model come from?” and content: “The model runs on the device: It takes up storage space”, “How does it get to the device? Similar considerations as before: Download cost, availability, app size (download size and install size), updates. Latency and privacy will be the same for all options”, “Let’s look at options…”

When the model runs on-device, it has to be on the device to run, taking up storage space.

However, when you control the model, it doesn’t always have to be there. Perhaps the feature that uses it will only work for certain users, or on devices with certain capabilities. Using storage space unnecessarily is disrespectful towards users and it can lead to more uninstalls, as said.

We can start thinking about who is going to get this model or how is the model going to get to the device, coming up with different alternatives and analysing the trade-offs like we did before. This time, latency and privacy considerations will be the same for all options. We need to think of download cost, availability, app size and updates.

Again, there are no good or bad options, only options more or less suitable for your needs and we’ll look into them

Before we go into it, consider whether you can use the models that are part of the operating system.

Part of the operating system

Now, we can have on-device execution of Gemini Nano powered by Android AICore on some devices (at the time of giving the talk, Google Pixel 8 Pro and Samsung S24 series, but this has increased quickly). It’s part of the operating system and you can just use it.

If this is enough for your needs, you don’t have to build a model and you’ll have a minimal app size increase due to the Google AI Edge SDK for Android that you’ll have to include to communicate with the model, which is great. What’s not so great is that you have no control over what’s in the model, its updates, and its availability, which is currently limited but likely to expand soon.

Therefore this is a good approach if Gemini Nano without customisations suits your needs.

Things get more interesting when you have your own model and you have to distribute it. In that case, consider if…

It’s part of the main app

The easiest option is to put the model in the assets folder or raw folder in the resources and make the app read it from there. The model will be downloaded from the app store with the app, so the app store owner (Google if we’re using Google Play) will pay for the download cost. The model will always be available since it’s always part of the app.

The cons are that the download size increases, that all users will have the model taking up their storage space (even if they can’t use the feature) and that the model can only be updated when the app is updated.

Since it’s easy and quick, it’s a great option for prototypes that won’t be released.

In production, it can be useful if all users can use the feature and all of them use the same model. Since they’re all going to need the model, it may as well come as part of the app. We haven’t released any such features so far, we only put some engines that all users use in the main app (our typing engine and ONNX). Other apps that don’t depend so much on language may find this approach useful.

It uses on-demand dynamic delivery

Slide for “On-demand dynamic delivery”, with a diagram showing the AI feature in the resources/assets of a dynamic feature/asset. Advantages: the app store pays for download costs, doesn’t increase download size, model uses storage space only if users can use it. Disadvantages: model available if the dynamic feature/asset is, app update needed to update the model, not all app stores support dynamic delivery. Suitable for: features for some users, one or very few different models.

We can avoid some of these issues if we put the model in a dynamic feature or assets pack instead. In short, these are optional parts (the app can work without them) that can be installed at install time, when some conditions are met or on demand. Note that Google Play supports dynamic delivery but other app stores don’t, so this may not be an option for you. You can read more about this in the Play Feature Delivery and Play Asset Delivery documentation.

Since the dynamic features or asset packs are installed from Google Play, Google will pay for the download costs as well. However, this time we can configure dynamic features or asset packs to be installed on demand (when we want, not necessarily at install time). Then the download size won’t increase and we can only download the dynamic feature or asset pack for users or devices that can or want to use it.

The disadvantages are that dynamic delivery is relatively complex. We need to do some more work to handle the installation of the dynamic feature or asset pack. The installation may fail (e.g. if the user has no internet) and then the feature won’t be available. Testing can also be more difficult. We still need to update the app to update the model.

This approach is suitable if the feature can only be used by some users and all of those users can use the same model.

This can scale a little if we have very few models for each feature, for example having two dynamic features or asset packs with different models and choosing which one to install. For us, it won’t work for language-based features, since we won’t be able to support 700+ languages multiplied by a number of features. Google don’t recommend having more than 50 dynamic features. There’s also a limit of 200MB compressed download size for the main app download and every subsequent dynamic feature download. You can deliver bigger files (up to 1.5GB) with asset packs.

Slide with title “Dynamic delivery example: camera” and some examples of images modified by the camera feature that turn a person’s face into a comic character and a potato, and a person’s hand into a big red hand.

An example where we use a dynamic feature is our camera feature. It allows users to take and edit photos and videos and it includes a lot of Snapchat filters for video, which let users transform videos in real time tracking their face or hands and transforming them or overlaying something over them. They all track in video in real time.

These filters are relatively big and we realise not all of our users will be interested in this feature — some want a keyboard to type — so we download it on demand when users try to use the feature for the first time. When this happens we show a spinner while the feature is being downloaded and installed and from that moment on it can be used like any other feature. Users who like videos will be happy to have the feature even if it needs some storage, and users who just want to type can save some storage.

It’s downloaded at runtime

Another option is the different approach of hosting models in a server and downloading them when needed to deliver many different models to different users.

This is similar to the on-demand dynamic feature approach in the sense that the download size will stay small and we’ll be able to only download the model we need when we need it, so we won’t use unnecessary storage space either. We can update models independently from app updates, if we build an updating mechanism (being careful with backwards compatibility).

The cons are that we have to handle the model download, which may fail, and, especially, the cost. Hosting data that can be downloaded has a cost that we’re paying this time, not the app store owner.

This is a recurrent cost that we’ll have every time there is a new model.

Therefore, this approach is suitable if we need different users to have many different models and especially if these models are small (to keep costs down).

Slide with title “Downloaded example: 700+ typing languages”, the UI used to download languages and multiple screenshots of SwiftKey showing layouts for different languages. The English QWERTY layout, where the user is flowing “Hey” is in the middle. Around it, we can find the Tibetan, Korean, Tamazight, Sinhala, Makassarese, Persian, Coptic and Bulgarian layouts.

We use this approach for example for the language packs that allow our users to type. We support over 700 languages but handling over 700 dynamic features both isn’t recommended and doesn’t sound fun, so language packs are in a server.

When users want to download new languages, they go to the languages screen, which has a list of languages. When they tap on the language they want, the app downloads a file from a server and that contains all they need to type in that language.

As said, there are over 700 of these language packs and their size varies between around 2 and around 20MB. Nobody wants hundreds of MB of storage dedicated to languages they can’t speak, so only those selected in the list are downloaded and take up space (our initial guess and what users manually select).

Slide with title “More downloaded examples: typing extras” and examples of emoji and Mandarin handwriting features.

Anything to do with typing in different languages follows this approach. For example:

Emoji. We have emoji prediction both as you’re typing (so you type “love” and one of the predictions becomes an emoji with hearts) and in a predictive emoji panel, where you can type something and get a list of relevant emoji. All of this needs different models for different languages.
Handwriting, which is only available for some languages like Mandarin, where the number of characters is extremely high and drawing the character you want instead of looking for it in a long list is a popular input method.

Summary

Slide with title “On-device approaches” that summarises the content considering cost, availability, app size for users who can’t use it, model updates, suitability, and examples. The whole content doesn’t fit here due to Medium’s alt text limitations.

If you can use Gemini Nano, it’s probably the easiest option. If you have your model, the easiest option is to make it part of the main app, but if you don’t want to take up storage of users who can’t use the feature you’ll have to either use dynamic delivery (if there aren’t many models) or download models from a server otherwise.

Conclusion

We hope seeing what we’ve been thinking about helps you make better decisions in the future.

The SwiftKey Keyboard in action typing “Thank you for still being here today!” by sliding a finger over the keys.

Useful links

Distributing big AI models in mobile apps was originally published in Microsoft Mobile Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI in the cloud vs on device

In the cloud

On-device

A note on latency

Summary

AI on device

Part of the operating system

It’s part of the main app

It uses on-demand dynamic delivery

It’s downloaded at runtime

Summary

Conclusion

Useful links

Leave a Reply Cancel reply