Design doc · M4+ reference

Picking the on-device model for a commercial Android product, worldwide. 全球商用场景下，设备端 Agent 的模型怎么选。

Design note · 2026-04-21 · owner: sky · status: recommendation

Companion to the on-device agent design note. That page answers why and how the host talks to the device. This page answers which model, and what hardware it needs — for products shipped globally, with a 5–8 year lifetime, and legal teams that read licenses carefully. 这是设备端 Agent 设计笔记的配套文档。那一页讲"为什么做"和"host ↔ device 协议"；本文讲"设备端那个本地模型具体选哪个、需要什么硬件"—— 面向全球销售、5–8 年生命周期、法务会仔细看许可的硬件产品。

TL;DR

Phi-3.5-mini (MIT) + llama.cpp, on ARMv8.2 devices with ≥4 GB RAM.

Default: Microsoft Phi-3.5-mini Q4_K_M (~2.4 GB). MIT license, strong English reasoning, tool calling works.
RAM-constrained (≤3 GB): IBM Granite 3.2-2B (Apache 2.0). Explicit patent grant, enterprise-friendly.
Hard floor: ARMv8.2-A dotprod + 3 GB RAM + arm64 Android 10+. Below that, don't ship on-device agent — route everything to host.
Avoid for 5–8 year global SKUs: Gemma (unilateral policy changes), Llama (700M MAU clause), Mistral non-commercial, Apple OpenELM (distribution restricted).

Four constraints that rule everything else out.

Global + commercial + hardware is stricter than most LLM demos ever face. These four, in order, decide what is even on the table.

Licensing legal can greenlight at a glance. No "700M MAU and we renegotiate" (Llama), no unilaterally-updatable Prohibited Use Policy (Gemma), no "Built with X" branding obligation. Once a device ships, you cannot chase changed terms back.
Hardware ceilings vary wildly. Entry-level devices have 2–4 GB RAM, mid 4–6, high 8+. SoCs span Snapdragon 8 to Allwinner A53 — many orders of magnitude apart. Industrial and kiosk builds usually have no GMS or AICore, which kicks Google's on-device stack out of contention.
What the device actually has to do. English is enough (host translates). Structured output + tool calling is the core job. Chinese and other languages are the host's problem, not the device's.
Product lifetime, not demo lifetime. Hardware lives 5–8 years in the field. Weights and licenses have to last that long. A model that depends on a vendor portal still being up in 2032 is already a risk.

Candidates worth shipping.

All rows below are MIT or Apache 2.0 — clean enough for a BOM review without a special legal cycle.

Model	Params	License	Clean	Capability	Hardware fit
Phi-3.5-mini-instruct	3.8B	MIT	★★★★★	Top-tier English reasoning at size, tool calling	Q4_K_M ≈ 2.4 GB · 4 GB+ RAM
Phi-4-mini-instruct	3.8B	MIT	★★★★★	Stronger than 3.5, native function calling	Same as above
IBM Granite 3.2-2B	2B	Apache 2.0	★★★★★	Enterprise-trained; structured output + tools are the focus	Q4 ≈ 1.3 GB · 3 GB RAM works
IBM Granite 3.2-8B	8B	Apache 2.0	★★★★★	Strong	Q4 ≈ 4.5 GB · 8 GB+ RAM
SmolLM2-1.7B-Instruct	1.7B	Apache 2.0	★★★★★	Modest, but fully open: weights + data + training code	Q4 ≈ 1.0 GB · 2–3 GB RAM
Qwen2.5-3B-Instruct	3B	Apache 2.0	★★★★☆	Excellent English, best-in-class tool calling	Q4 ≈ 1.9 GB · Alibaba origin — extra legal question in some markets

Why these are off the table.

Model	License	Why not, for a 5–8 year global SKU
Gemma 3 family	Gemma Terms	Google reserves the right to update the Prohibited Use Policy. Unsafe over a multi-year lifetime.
Llama 3.2 1B / 3B	Llama Community License	700M MAU clause + AUP + "Built with Llama" labeling — every legal pass is a fresh review.
Mistral Ministral 3B	Research-only	No commercial grant.
OpenELM (Apple)	Apple Sample Code License	Redistribution restricted — cannot be shipped inside an app.
Gemini Nano (AICore)	Google AICore Terms	Requires Pixel 8+ / AICore-blessed GMS device. Most commercial hardware does not qualify.

Hardware — not every Android SoC can host an LLM.

Hard floor (below this, on-device agent is off the table)

Item	Minimum	Why
ARM architecture	ARMv8-A 64-bit	32-bit will not run
ARMv8.2-A `dotprod` (`sdot` / `udot`)	Required	Core of INT8 quantized matmul. Without it, INT8 is slower than FP32. This is the real go/no-go line.
RAM	≥3 GB (1.7–2B model) · ≥4 GB (3B model)	Weights + KV cache + Android + app
Storage	≥8 GB (leave 4 GB for model + OTA)	Q4 weights are 1–2.5 GB
Android version	Android 10+, 64-bit userspace	NDK inference libs are arm64-v8a only

Accelerators (nice to have, not required)

Feature	Speedup
ARMv8.2 FP16 (`fphp` + `asimdhp`)	1.5–2× for mixed precision
ARMv8.6-A `i8mm` (Cortex-A715 / X3+)	2–3× more on INT8 matmul
ARMv9 SVE2	Higher throughput than NEON
NPU (Hexagon / MTK APU / RK NPU)	3–10× on INT8, significantly lower power
GPU OpenCL / Vulkan (Adreno / Mali)	Moderate speedup

Explicitly not needed

Root. GMS / Google Play Services. AICore / Gemini Nano. NNAPI (being deprecated — use vendor NPU SDKs directly). Network connectivity.

Android SoCs in three tiers.

A rough map of what to expect from a given SoC family. Test on actual hardware before freezing the BOM — silicon revisions matter.

Tier A · smooth

Runs Phi-3.5-mini (3.8B) comfortably流畅跑 Phi-3.5-mini (3.8B)

Cortex-A76/A78 big cores + A55, ARMv8.2 dotprod + FP16, often i8mm, with NPU, 4–8 GB RAM

Qualcomm: SD 6 Gen 1, 7 Gen 1, 8-series, QCM6490 (IoT), QCM8550
MediaTek: Dimensity 700 / 800 / 900 / 9000, Helio G99
Rockchip: RK3588 / RK3588S (6 TOPS NPU — industrial / edge flagship)
Amlogic: A311D2, S905X4

Measured Phi-3.5-mini Q4_K_M: 6–15 tok/s on CPU, 20–40 tok/s on Hexagon / NPU.

Tier B · feasible

Fits Granite 3.2-2B or SmolLM2-1.7B跑 Granite 3.2-2B 或 SmolLM2-1.7B

All-A55 (dotprod + FP16, no big cores) or A73+A53, 3–4 GB RAM, weak or no NPU

Qualcomm: QCM4290, SD 4 Gen 2, SD 680 / 685
MediaTek: Helio G85, G37
Rockchip: RK3566 / RK3568 (0.8 TOPS NPU)
UNISOC: T610 / T618 / T620 / T606 / T616 (global mid-tier workhorses)

2B Q4 on CPU: 3–7 tok/s. Phi-3.5-mini fits but 4 GB RAM frequently OOMs.

Tier C · skip

Do not ship on-device LLM here不要上 on-device LLM

Cortex-A53 only (no dotprod) or older, or RAM < 3 GB

Qualcomm: SD 439 / 450 / 460 and older 4xx
MediaTek: MT6765 / 6762, Helio A22 / A25
UNISOC: SC9863A
Rockchip: RK3326, PX30, RK3399 (A72+A53 — transitional)
Allwinner: A133, A64, H616 — all A53, skip

It technically boots, but 1–2 tok/s is not an interactive experience. Route to host via adb / ssh / UART instead.

Plan three SKUs. Don't try to pick one model for everything.

Hardware lines span tiers. Map each tier to the model it can actually run, write the requirements into the BOM, and fall back to host-only on the lowest end.

SKU 1 · AI-ReadySKU 1 · AI-Ready

High-end / current generation

ARMv8.2-A + dotprod + FP16
≥4 GB RAM (8 GB recommended)
≥8 GB eMMC / UFS
Android 11+
NPU (Hexagon / RK NPU)

Ships: Phi-3.5-mini (MIT) GGUF Q4_K_M

QCM6490 · SD 6 Gen 1 · Dimensity 700+ · RK3588 · A311D2

SKU 2 · Light AISKU 2 · Light AI

Mid-range

ARMv8.2-A + dotprod (no pure A53)
≥3 GB RAM
≥4 GB eMMC
Android 10+
NPU optional

Ships: IBM Granite 3.2-2B or SmolLM2-1.7B (Apache 2.0)

QCM4290 · SD 4 Gen 2 · Helio G85/G99 · UNISOC T618 · RK3566

SKU 3 · Host-onlySKU 3 · 纯 Host

Entry-level / legacy

Pure A53 or < 3 GB RAM does not ship on-device agent. Host owns the loop — device streams logs over adb / ssh / UART. For "daily checks, remote diagnosis, failure reports" this is usually more than enough.

Ships: No local model. Uses alb host transports only.

The inference stack.

Layer	Pick	License	Note
Engine (default)	llama.cpp	MIT	ARM NEON / dotprod / i8mm tuning is mature. GGUF is the universal format. Qualcomm QNN fork in progress.
Engine (NPU-first)	ONNX Runtime Mobile	MIT	Hexagon / QNN EP is the most mature. Official Phi support.
Engine (PyTorch)	ExecuTorch	BSD	Meta's new mobile stack. Useful when you already live in PyTorch.
Engine (Google ecosystem)	LiteRT (ex-TFLite)	Apache 2.0	Phi / Granite tooling is further along in llama.cpp. LiteRT is best for Gemma, which we are not shipping.

Recommended combo: llama.cpp + GGUF Q4_K_M. Runs everywhere with small code footprint. If a specific SKU has Hexagon and you want peak throughput, ship a separate ONNX Runtime + QNN variant for that one.

Five-second field check.

Given an engineering sample, four commands decide whether on-device agent is viable:

adb shell cat /proc/cpuinfo | grep -E "asimddp|fphp|i8mm|sve"
adb shell cat /proc/meminfo | grep MemTotal
adb shell getprop ro.build.version.release    # need ≥ 10
adb shell getprop ro.product.cpu.abi          # need arm64-v8a

asimddp (= dotprod) — present means INT8 inference is on the table. Absent → tier C.
fphp + asimdhp — FP16 available, another 1.5–2× speedup.
i8mm — top-shelf (Cortex-A715 / X3+). Rare in shipping products today.
sve / sve2 — ARMv9. Effectively absent from Android devices right now.
MemTotal ≥ 3.5 GB → tier B starting point. ≥ 7 GB → tier A.

For a global commercial SKU with a 5–8 year lifetime: Phi-3.5-mini (MIT) + llama.cpp, on ARMv8.2 devices with ≥ 4 GB RAM. Squeeze down to Granite 3.2-2B (Apache 2.0) when RAM is tight. Below ARMv8.2 dotprod, don't bother — route to host.

Chinese and other languages are handled by the host-side LLM (Qwen2.5-3B, Claude, GPT). The device only speaks structured English. Clean division of labor, no language-training tax on the device fleet.