Google Gemma 4: Open Models for Local Agentic AI
Google DeepMind has released Gemma 4, a new family of open models built from the same research behind Gemini 3. What makes this launch significant is that these models are designed to run locally on your own hardware — from smartphones to Raspberry Pi — with advanced agentic capabilities and a fully permissive Apache 2.0 license.
Four Models for Every Scenario
Gemma 4 ships in four sizes targeting different deployment scenarios:
- E2B (Effective 2B parameters): For mobile and IoT, runs with under 1.5GB of memory
- E4B (Effective 4B parameters): For edge devices with native audio and visual input
- 26B MoE (Mixture of Experts): For workstations balancing performance and efficiency
- 31B Dense: The most powerful variant, ranked #3 globally among open models
The 31B model ranks third on the Arena AI leaderboard, outperforming models 20x its size in parameter count.
Multimodal by Default
All four models process images and video natively. The smaller E2B and E4B variants go further with native audio input, enabling real-time speech understanding directly on device — no internet connection required.
Context windows reach 128K tokens for smaller models and 256K tokens for the larger ones, with support for over 140 languages.
Agentic Capabilities: AI Agents on Your Device
The standout feature of Gemma 4 is Agent Skills — autonomous workflows running entirely on-device. These enable:
- Native function calling to interact with tools and APIs
- Structured JSON output for reliable production applications
- Multi-step planning and autonomous action execution
- External knowledge access like querying Wikipedia
- Interactive content generation including summaries and flashcards
This means you can build AI agents that run entirely on your hardware without sending data to the cloud.
Edge Performance Numbers
The performance figures are impressive for locally-running models:
- Mobile: Processes 4,000 input tokens across 2 distinct skills in under 3 seconds
- Raspberry Pi 5: 133 tokens/second prefill, 7.6 tokens/second decode
- Platforms: Android, iOS, Windows, Linux, macOS (Metal), WebGPU, Qualcomm IQ8 NPU
For the 31B model, a 16GB VRAM GPU is enough to run it at full speed, making advanced AI accessible to any developer with a modern workstation.
Apache 2.0: Full Freedom
Unlike some open models that come with commercial use restrictions or monthly active user caps, Gemma 4 ships under full Apache 2.0:
- No monthly active user limits
- No acceptable-use policy enforcement
- Full commercial and sovereign deployment freedom
- Free to modify and redistribute
This makes it an ideal choice for MENA businesses building local AI solutions while maintaining data sovereignty.
Framework Ecosystem
Gemma 4 is available immediately across a broad tool ecosystem:
- Hugging Face: Transformers, TRL, Transformers.js
- Local inference: llama.cpp, Ollama, LM Studio, MLX (Apple Silicon)
- Production: vLLM, SGLang, NVIDIA NIM, Baseten
- Edge: LiteRT-LM, Google AI Edge Gallery
- Fine-tuning: Unsloth, Keras, MaxText
What This Means for Developers
With 140+ language support, a fully open license, and the ability to run on modest hardware, Gemma 4 opens new possibilities:
- Offline-capable apps with native AI on mobile devices
- Local coding assistants as alternatives to expensive cloud APIs
- Enterprise solutions that keep data within the local network
- Multimodal chatbots that understand text, audio, and images
Bottom Line
Gemma 4 is not just an update to Google's open model family — it represents a meaningful shift in running advanced AI on consumer hardware. With agentic capabilities, multimodal support, and broad language coverage, any developer can now build sophisticated AI applications without depending on costly cloud APIs.
Available now on Hugging Face and Google AI Studio for testing and deployment.
Discuss Your Project with Us
We're here to help with your web development needs. Schedule a call to discuss your project and how we can assist you.
Let's find the best solutions for your needs.