Apple's Hybrid Intelligence Architecture
Combining local model inferencing with secure cloud computation
I wrote a iPadOS/iOS app “Mark Chat” a few months ago: https://apps.apple.com/us/app/markchat/id6747982917 I have good intentions of adding this app as an example to my Swift AI book (read online: https://leanpub.com/SwiftAI/read).
This app combines Apple’s local 3B parameter system model with secure cloud based inference, as needed. Apple’s AI efforts have received some justified criticism for slow rollout of features, but I believe Apple is on the right track.
When using Apple’s hybrid model system, what is computed locally? Here is a rough breakdown:
Processed On-Device (Local): These tasks are typically low-complexity, highly repetitive, context-dependent, and privacy-sensitive. They must be fast and available offline.
Writing Tools: Proofreading, tone adjustments (friendly, professional, concise).
Summarization: Notification summaries, message preview summaries.
Generation: Genmoji creation.
Siri Context: Understanding on-device data (e.g., “What’s on my calendar?”).
Photos: Intelligent search (after local indexing) and the “Clean up” tool.
Escalated to Private Cloud Compute (PCC): These tasks require more powerful generative models for higher-quality output, deeper reasoning, or broad-world knowledge.
Advanced Writing Tools: More complex “Rewrite” functions, Summary, Key Points, Lists, and Tables.
Mail: Full email summarization and Smart Replies.
Summarization (Broad): Summaries for Safari web pages and Notes audio recordings.
ChatGPT Integration: Any request explicitly routed to a third-party model (which requires user permission).
Architecture of the ~3B Parameter On-Device Model
The default on-device model is a highly optimized foundation model with approximately 3 billion parameters. It is not a generic, off-the-shelf model but a custom-built LLM designed specifically for efficient inference on Apple Silicon’s Neural Engine.
Performance and Optimization: To meet the strict memory, power, and performance requirements of a mobile device, the model employs aggressive optimization techniques. This includes a mixed 2-bit and 4-bit “low-bit palletization” strategy, achieving an average of 3.7 bits-per-weight. This aggressive quantization is key to fitting a 3B-parameter model into the device’s memory. On an iPhone 15 Pro (A17 Pro chip), this optimized model is capable of generating approximately 30 tokens per second.
Model Capabilities (Internal vs. External View): Apple’s internal human-evaluation benchmarks show this ~3B model outperforming larger, well-regarded open-source models like Mistral-7B, Gemma-7B, and Llama-3-8B on a variety of user-facing tasks.
This data, however, is complemented by third-party developer benchmarks. These independent tests suggest that on raw academic NLP benchmarks (like MMLU), the base on-device model may underperform similarly-sized models like Phi-3 Mini. This juxtaposition does not imply a contradiction, but rather a clarification of the model’s purpose. It is not a general-purpose, high-knowledge LLM; it is a highly-tuned, task-oriented engine optimized for the specific functions of Apple Intelligence (summarization, tone adjustment, etc.).

