android-llm-bridge
View on GitHub
Design doc · M4+ reference

Picking the on-device model for a commercial Android product, worldwide. 全球商用场景下,设备端 Agent 的模型怎么选。

Design note · 2026-04-21 · owner: sky · status: recommendation

Companion to the on-device agent design note. That page answers why and how the host talks to the device. This page answers which model, and what hardware it needs — for products shipped globally, with a 5–8 year lifetime, and legal teams that read licenses carefully. 这是 设备端 Agent 设计笔记的配套文档。那一页讲"为什么做"和"host ↔ device 协议";本文讲"设备端那个本地模型具体选哪个、需要什么硬件"—— 面向全球销售、5–8 年生命周期、法务会仔细看许可的硬件产品。

TL;DR

Phi-3.5-mini (MIT) + llama.cpp, on ARMv8.2 devices with ≥4 GB RAM.

  • Default: Microsoft Phi-3.5-mini Q4_K_M (~2.4 GB). MIT license, strong English reasoning, tool calling works.
  • RAM-constrained (≤3 GB): IBM Granite 3.2-2B (Apache 2.0). Explicit patent grant, enterprise-friendly.
  • Hard floor: ARMv8.2-A dotprod + 3 GB RAM + arm64 Android 10+. Below that, don't ship on-device agent — route everything to host.
  • Avoid for 5–8 year global SKUs: Gemma (unilateral policy changes), Llama (700M MAU clause), Mistral non-commercial, Apple OpenELM (distribution restricted).

Four constraints that rule everything else out.

Global + commercial + hardware is stricter than most LLM demos ever face. These four, in order, decide what is even on the table.

  1. Licensing legal can greenlight at a glance. No "700M MAU and we renegotiate" (Llama), no unilaterally-updatable Prohibited Use Policy (Gemma), no "Built with X" branding obligation. Once a device ships, you cannot chase changed terms back.
  2. Hardware ceilings vary wildly. Entry-level devices have 2–4 GB RAM, mid 4–6, high 8+. SoCs span Snapdragon 8 to Allwinner A53 — many orders of magnitude apart. Industrial and kiosk builds usually have no GMS or AICore, which kicks Google's on-device stack out of contention.
  3. What the device actually has to do. English is enough (host translates). Structured output + tool calling is the core job. Chinese and other languages are the host's problem, not the device's.
  4. Product lifetime, not demo lifetime. Hardware lives 5–8 years in the field. Weights and licenses have to last that long. A model that depends on a vendor portal still being up in 2032 is already a risk.

Candidates worth shipping.

All rows below are MIT or Apache 2.0 — clean enough for a BOM review without a special legal cycle.

Model Params License Clean Capability Hardware fit
Phi-3.5-mini-instruct 3.8B MIT ★★★★★ Top-tier English reasoning at size, tool calling Q4_K_M ≈ 2.4 GB · 4 GB+ RAM
Phi-4-mini-instruct 3.8B MIT ★★★★★ Stronger than 3.5, native function calling Same as above
IBM Granite 3.2-2B 2B Apache 2.0 ★★★★★ Enterprise-trained; structured output + tools are the focus Q4 ≈ 1.3 GB · 3 GB RAM works
IBM Granite 3.2-8B 8B Apache 2.0 ★★★★★ Strong Q4 ≈ 4.5 GB · 8 GB+ RAM
SmolLM2-1.7B-Instruct 1.7B Apache 2.0 ★★★★★ Modest, but fully open: weights + data + training code Q4 ≈ 1.0 GB · 2–3 GB RAM
Qwen2.5-3B-Instruct 3B Apache 2.0 ★★★★☆ Excellent English, best-in-class tool calling Q4 ≈ 1.9 GB · Alibaba origin — extra legal question in some markets

Why these are off the table.

Model License Why not, for a 5–8 year global SKU
Gemma 3 familyGemma TermsGoogle reserves the right to update the Prohibited Use Policy. Unsafe over a multi-year lifetime.
Llama 3.2 1B / 3BLlama Community License700M MAU clause + AUP + "Built with Llama" labeling — every legal pass is a fresh review.
Mistral Ministral 3BResearch-onlyNo commercial grant.
OpenELM (Apple)Apple Sample Code LicenseRedistribution restricted — cannot be shipped inside an app.
Gemini Nano (AICore)Google AICore TermsRequires Pixel 8+ / AICore-blessed GMS device. Most commercial hardware does not qualify.

Hardware — not every Android SoC can host an LLM.

Hard floor (below this, on-device agent is off the table)

ItemMinimumWhy
ARM architectureARMv8-A 64-bit32-bit will not run
ARMv8.2-A dotprod (sdot / udot)RequiredCore of INT8 quantized matmul. Without it, INT8 is slower than FP32. This is the real go/no-go line.
RAM≥3 GB (1.7–2B model) · ≥4 GB (3B model)Weights + KV cache + Android + app
Storage≥8 GB (leave 4 GB for model + OTA)Q4 weights are 1–2.5 GB
Android versionAndroid 10+, 64-bit userspaceNDK inference libs are arm64-v8a only

Accelerators (nice to have, not required)

FeatureSpeedup
ARMv8.2 FP16 (fphp + asimdhp)1.5–2× for mixed precision
ARMv8.6-A i8mm (Cortex-A715 / X3+)2–3× more on INT8 matmul
ARMv9 SVE2Higher throughput than NEON
NPU (Hexagon / MTK APU / RK NPU)3–10× on INT8, significantly lower power
GPU OpenCL / Vulkan (Adreno / Mali)Moderate speedup

Explicitly not needed

Root. GMS / Google Play Services. AICore / Gemini Nano. NNAPI (being deprecated — use vendor NPU SDKs directly). Network connectivity.

Android SoCs in three tiers.

A rough map of what to expect from a given SoC family. Test on actual hardware before freezing the BOM — silicon revisions matter.

Tier A · smooth

Runs Phi-3.5-mini (3.8B) comfortably流畅跑 Phi-3.5-mini (3.8B)

Cortex-A76/A78 big cores + A55, ARMv8.2 dotprod + FP16, often i8mm, with NPU, 4–8 GB RAM

  • Qualcomm: SD 6 Gen 1, 7 Gen 1, 8-series, QCM6490 (IoT), QCM8550
  • MediaTek: Dimensity 700 / 800 / 900 / 9000, Helio G99
  • Rockchip: RK3588 / RK3588S (6 TOPS NPU — industrial / edge flagship)
  • Amlogic: A311D2, S905X4

Measured Phi-3.5-mini Q4_K_M: 6–15 tok/s on CPU, 20–40 tok/s on Hexagon / NPU.

Tier B · feasible

Fits Granite 3.2-2B or SmolLM2-1.7B跑 Granite 3.2-2B 或 SmolLM2-1.7B

All-A55 (dotprod + FP16, no big cores) or A73+A53, 3–4 GB RAM, weak or no NPU

  • Qualcomm: QCM4290, SD 4 Gen 2, SD 680 / 685
  • MediaTek: Helio G85, G37
  • Rockchip: RK3566 / RK3568 (0.8 TOPS NPU)
  • UNISOC: T610 / T618 / T620 / T606 / T616 (global mid-tier workhorses)

2B Q4 on CPU: 3–7 tok/s. Phi-3.5-mini fits but 4 GB RAM frequently OOMs.

Tier C · skip

Do not ship on-device LLM here不要上 on-device LLM

Cortex-A53 only (no dotprod) or older, or RAM < 3 GB

  • Qualcomm: SD 439 / 450 / 460 and older 4xx
  • MediaTek: MT6765 / 6762, Helio A22 / A25
  • UNISOC: SC9863A
  • Rockchip: RK3326, PX30, RK3399 (A72+A53 — transitional)
  • Allwinner: A133, A64, H616 — all A53, skip

It technically boots, but 1–2 tok/s is not an interactive experience. Route to host via adb / ssh / UART instead.

Plan three SKUs. Don't try to pick one model for everything.

Hardware lines span tiers. Map each tier to the model it can actually run, write the requirements into the BOM, and fall back to host-only on the lowest end.

1

SKU 1 · AI-ReadySKU 1 · AI-Ready

High-end / current generation

  • ARMv8.2-A + dotprod + FP16
  • ≥4 GB RAM (8 GB recommended)
  • ≥8 GB eMMC / UFS
  • Android 11+
  • NPU (Hexagon / RK NPU)

Ships: Phi-3.5-mini (MIT) GGUF Q4_K_M

QCM6490 · SD 6 Gen 1 · Dimensity 700+ · RK3588 · A311D2

2

SKU 2 · Light AISKU 2 · Light AI

Mid-range

  • ARMv8.2-A + dotprod (no pure A53)
  • ≥3 GB RAM
  • ≥4 GB eMMC
  • Android 10+
  • NPU optional

Ships: IBM Granite 3.2-2B or SmolLM2-1.7B (Apache 2.0)

QCM4290 · SD 4 Gen 2 · Helio G85/G99 · UNISOC T618 · RK3566

3

SKU 3 · Host-onlySKU 3 · 纯 Host

Entry-level / legacy

Pure A53 or < 3 GB RAM does not ship on-device agent. Host owns the loop — device streams logs over adb / ssh / UART. For "daily checks, remote diagnosis, failure reports" this is usually more than enough.

Ships: No local model. Uses alb host transports only.

The inference stack.

LayerPickLicenseNote
Engine (default)llama.cppMITARM NEON / dotprod / i8mm tuning is mature. GGUF is the universal format. Qualcomm QNN fork in progress.
Engine (NPU-first)ONNX Runtime MobileMITHexagon / QNN EP is the most mature. Official Phi support.
Engine (PyTorch)ExecuTorchBSDMeta's new mobile stack. Useful when you already live in PyTorch.
Engine (Google ecosystem)LiteRT (ex-TFLite)Apache 2.0Phi / Granite tooling is further along in llama.cpp. LiteRT is best for Gemma, which we are not shipping.

Recommended combo: llama.cpp + GGUF Q4_K_M. Runs everywhere with small code footprint. If a specific SKU has Hexagon and you want peak throughput, ship a separate ONNX Runtime + QNN variant for that one.

Five-second field check.

Given an engineering sample, four commands decide whether on-device agent is viable:

adb shell cat /proc/cpuinfo | grep -E "asimddp|fphp|i8mm|sve"
adb shell cat /proc/meminfo | grep MemTotal
adb shell getprop ro.build.version.release    # need ≥ 10
adb shell getprop ro.product.cpu.abi          # need arm64-v8a
  • asimddp (= dotprod) — present means INT8 inference is on the table. Absent → tier C.
  • fphp + asimdhp — FP16 available, another 1.5–2× speedup.
  • i8mm — top-shelf (Cortex-A715 / X3+). Rare in shipping products today.
  • sve / sve2 — ARMv9. Effectively absent from Android devices right now.
  • MemTotal ≥ 3.5 GB → tier B starting point. ≥ 7 GB → tier A.
For a global commercial SKU with a 5–8 year lifetime: Phi-3.5-mini (MIT) + llama.cpp, on ARMv8.2 devices with ≥ 4 GB RAM. Squeeze down to Granite 3.2-2B (Apache 2.0) when RAM is tight. Below ARMv8.2 dotprod, don't bother — route to host.

Chinese and other languages are handled by the host-side LLM (Qwen2.5-3B, Claude, GPT). The device only speaks structured English. Clean division of labor, no language-training tax on the device fleet.

Related reading

The full Markdown version of this note (with verbatim source, tags, status) lives next to the HTML.

Full Markdown source Back to on-device agent Home