Google Gemma 4: Open Models for Local Agentic AI

Google DeepMind has released Gemma 4, a new family of open models built from the same research behind Gemini 3. What makes this launch significant is that these models are designed to run locally on your own hardware — from smartphones to Raspberry Pi — with advanced agentic capabilities and a fully permissive Apache 2.0 license.

Four Models for Every Scenario

Gemma 4 ships in four sizes targeting different deployment scenarios:

E2B (Effective 2B parameters): For mobile and IoT, runs with under 1.5GB of memory
E4B (Effective 4B parameters): For edge devices with native audio and visual input
26B MoE (Mixture of Experts): For workstations balancing performance and efficiency
31B Dense: The most powerful variant, ranked #3 globally among open models

The 31B model ranks third on the Arena AI leaderboard, outperforming models 20x its size in parameter count.

Multimodal by Default

All four models process images and video natively. The smaller E2B and E4B variants go further with native audio input, enabling real-time speech understanding directly on device — no internet connection required.

Context windows reach 128K tokens for smaller models and 256K tokens for the larger ones, with support for over 140 languages.

Agentic Capabilities: AI Agents on Your Device

The standout feature of Gemma 4 is Agent Skills — autonomous workflows running entirely on-device. These enable:

Native function calling to interact with tools and APIs
Structured JSON output for reliable production applications
Multi-step planning and autonomous action execution
External knowledge access like querying Wikipedia
Interactive content generation including summaries and flashcards

This means you can build AI agents that run entirely on your hardware without sending data to the cloud.

Edge Performance Numbers

The performance figures are impressive for locally-running models:

Mobile: Processes 4,000 input tokens across 2 distinct skills in under 3 seconds
Raspberry Pi 5: 133 tokens/second prefill, 7.6 tokens/second decode
Platforms: Android, iOS, Windows, Linux, macOS (Metal), WebGPU, Qualcomm IQ8 NPU

For the 31B model, a 16GB VRAM GPU is enough to run it at full speed, making advanced AI accessible to any developer with a modern workstation.

Apache 2.0: Full Freedom

Unlike some open models that come with commercial use restrictions or monthly active user caps, Gemma 4 ships under full Apache 2.0:

No monthly active user limits
No acceptable-use policy enforcement
Full commercial and sovereign deployment freedom
Free to modify and redistribute

This makes it an ideal choice for MENA businesses building local AI solutions while maintaining data sovereignty.

Framework Ecosystem

Gemma 4 is available immediately across a broad tool ecosystem:

Hugging Face: Transformers, TRL, Transformers.js
Local inference: llama.cpp, Ollama, LM Studio, MLX (Apple Silicon)
Production: vLLM, SGLang, NVIDIA NIM, Baseten
Edge: LiteRT-LM, Google AI Edge Gallery
Fine-tuning: Unsloth, Keras, MaxText

What This Means for Developers

With 140+ language support, a fully open license, and the ability to run on modest hardware, Gemma 4 opens new possibilities:

Offline-capable apps with native AI on mobile devices
Local coding assistants as alternatives to expensive cloud APIs
Enterprise solutions that keep data within the local network
Multimodal chatbots that understand text, audio, and images

Bottom Line

Gemma 4 is not just an update to Google's open model family — it represents a meaningful shift in running advanced AI on consumer hardware. With agentic capabilities, multimodal support, and broad language coverage, any developer can now build sophisticated AI applications without depending on costly cloud APIs.

Available now on Hugging Face and Google AI Studio for testing and deployment.