Google has announced the full release of Gemma 3n, its latest on-device AI model, delivering multimodal capabilities directly to smartphones and other edge devices. The AI model was first previewed last month.
“Building on this incredible momentum, we’re excited to announce the full release of Gemma 3n. While last month’s preview offered a glimpse, today unlocks the full power of this mobile-first architecture. Gemma 3n is designed for the developer community that helped shape Gemma. It’s supported by your favorite tools including Hugging Face Transformers, llama.cpp, Google AI Edge, Ollama, MLX, and many others, enabling you to fine-tune and deploy for your specific on-device applications with ease. This post is the developer deep dive: we’ll explore some of the innovations behind Gemma 3n, share new benchmark results, and show you how to start building today,” the company announced in a blog post.
The Gemma 3n comes with a new architectural design termed MatFormer, short for Matryoshka Transformer. Google explains this structure by likening it to Russian nesting dolls: the model contains smaller, fully functional sub-models nested within larger ones. This design grants developers the flexibility to scale performance dynamically based on the available hardware. Gemma 3n is currently available in two primary versions: E2B, which operates efficiently with as little as 2GB of memory, and E4B, requiring approximately 3GB.
Despite their raw parameter counts of 5 billion and 8 billion respectively, these models exhibit resource consumption comparable to much smaller models. This efficiency is further boosted by “Per-Layer Embeddings (PLE),” which can offload certain computational workloads from a device’s graphics processor to its central processor, thereby liberating valuable memory on the accelerator. Additionally, KV Cache Sharing is introduced to accelerate the processing of extended audio and video inputs, a feature Google claims can improve response times by up to two times.
Gemma 3n’s multimodal prowess is a key highlight. For speech-based applications, the model integrates a built-in audio encoder, adapted from Google’s Universal Speech Model. This allows it to perform tasks like speech-to-text conversion and language translation entirely on-device, without an internet connection. Initial evaluations have demonstrated particularly strong performance in translations between English and major European languages, including Spanish, French, Italian, and Portuguese. The audio encoder can process audio in 160-millisecond chunks, enabling detailed analysis of sound context.
The model’s visual understanding is powered by MobileNet-V5, Google’s latest lightweight vision encoder. This system is capable of processing video streams at up to 60 frames per second on devices such as the Google Pixel, enabling smooth, real-time video analysis directly on the device. Despite its optimized size and speed, MobileNet-V5 is reported to surpass earlier vision models in both performance and accuracy. Gemma 3n also supports over 140 languages for text processing and understanding content across 35 languages, setting a new benchmark for accessible on-device AI globally.
Developers can readily access and integrate Gemma 3n using a range of popular tools and frameworks, including Hugging Face Transformers, Ollama, MLX, and llama.cpp. To further stimulate innovation, Google has launched the “Gemma 3n Impact Challenge,” inviting developers to create applications that leverage the model’s offline and multimodal capabilities, with a prize pool of $150,000 for winning entries. This opens up possibilities for AI-powered apps in remote areas where internet connectivity is unreliable or nonexistent, as well as in privacy-sensitive scenarios where transmitting data to cloud-based models is not viable.