writing/blog/2026/06
BlogJun 10, 2026·6 min read

Google Gemma 4 QAT: Run Frontier AI Locally in 2026

Google's Gemma 4 QAT cuts VRAM by 72%, enabling a 26B-parameter model on a 16GB laptop. Deploy locally with Ollama, llama.cpp, or vLLM — zero per-token cost.

On June 6, 2026, Google DeepMind quietly published Quantization-Aware Training (QAT) checkpoints for the entire Gemma 4 family. The result: a 26B-parameter multimodal model that fits inside 15 GB of RAM, and a 2B model that boots on a Raspberry Pi 5. Local AI just got a serious upgrade.

What Is QAT and Why Does It Beat Standard Quantization?

Post-Training Quantization (PTQ) compresses a finished model by rounding weights to lower-precision formats. It is fast to apply but introduces rounding errors that compound across layers, often degrading accuracy by 5–15% on reasoning benchmarks.

Quantization-Aware Training (QAT) takes the opposite approach: it simulates quantization during training, forcing the model to learn weights that tolerate INT4 arithmetic from the start. The model never "sees" float values that later get brutally rounded — it was trained under those constraints.

The practical result for Gemma 4: the 26B-A4B QAT variant scores 82.6% on MMLU Pro, 88.3% on AIME 2026, and 77.1% on LiveCodeBench — figures nearly identical to the FP16 baseline — while running in roughly 15 GB of VRAM.

Contrast that with a naive INT4 conversion of the same model: 70.2% accuracy vs. 85.6% with Unsloth's dynamic GGUFs built on top of Google's QAT checkpoints. That 15-point gap is the cost of skipping QAT.

Model Variants and Hardware Requirements

Google released QAT checkpoints for four model sizes:

ModelVRAM (QAT 4-bit)ContextFits On
E2B~1 GB (mobile)128KPhones, Raspberry Pi 5
E4B~5 GB128K8 GB laptops
26B-A4B~15 GB256K16 GB machines
31B~18 GB256K24 GB GPUs

The E2B in mobile quantization format drops to under 1 GB — small enough to embed in an Android app without streaming from a server.

Deployment Option 1: Ollama (Fastest Start)

Ollama handles model download, format conversion, and a local API in one command:

# Install Ollama
brew install ollama          # macOS
curl -fsSL https://ollama.com/install.sh | sh   # Linux
 
# Pull a QAT model
ollama pull gemma4:e4b-it-qat
ollama pull gemma4:26b-it-qat
 
# Run interactively
ollama run gemma4:26b-it-qat "Summarize the key benefits of QAT in two sentences."
 
# Verify the REST API is live
curl http://localhost:11434/api/tags

Ollama serves an OpenAI-compatible endpoint at localhost:11434/v1, so existing code using the OpenAI SDK works with a one-line base URL change.

Deployment Option 2: llama.cpp (Maximum Control)

For lower-level control over sampling, quantization format, and multimodal projection, use llama.cpp directly with Unsloth's dynamic GGUFs:

# Interactive chat
./llama.cpp/llama-cli \
  -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
  --temp 1.0 --top-p 0.95 --top-k 64
 
# Local server with vision support
./llama.cpp/llama-server \
  --model gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
  --mmproj mmproj-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8001 \
  --chat-template-kwargs '{"enable_thinking":true}'

Always use the UD-Q4_K_XL variant from Unsloth rather than raw Q4_0 GGUFs — the dynamic format preserves accuracy on outlier layers that standard INT4 rounds aggressively.

Deployment Option 3: vLLM (Production Servers)

For teams running Gemma 4 as an internal API endpoint:

vllm serve google/gemma-4-31B-it-qat-w4a16-ct \
  --max-model-len 32768 \
  --port 8000

Cap --max-model-len to your actual usage. The full 256K context window reserves a large KV cache that limits request concurrency — on a 24 GB GPU, 32K is a reasonable starting point for multi-user scenarios.

Deployment Option 4: LiteRT-LM (Android / Edge)

For mobile and edge deployments, Google's LiteRT-LM runtime handles the low-precision kernels transparently:

  1. Export the E2B model to the mobile quantization schema via the ai-edge-torch library.
  2. Bundle the .task file in your Android assets directory.
  3. The runtime auto-detects NPU availability (Qualcomm, MediaTek, Google Tensor) and routes inference accordingly.

The E2B model runs at roughly 2x the throughput of its FP16 counterpart on mobile NPUs, consuming 40–50% less memory.

Apple Silicon (MLX)

On Macs with M-series chips, use the MLX backend for optimized unified-memory inference:

pip install mlx-lm
mlx_lm.generate \
  --model mlx-community/gemma-4-26B-A4B-it-qat-4bit \
  --prompt "Explain kernel quantization in one paragraph."

The unified memory architecture means the GPU and CPU share the same physical RAM pool — a 16 GB M3 MacBook Pro can comfortably run the 26B model without any swapping.

Critical Developer Tips

Sampling settings matter. Google tuned the QAT checkpoints with temperature 1.0, top_p 0.95, top_k 64. Changing these (especially to greedy decoding) can shift output quality unexpectedly.

Avoid raw Q4_0 GGUFs. A naive INT4 conversion loses up to 15 accuracy points on reasoning tasks. Unsloth's UD-Q4_K_XL format applies dynamic quantization group sizes to outlier layers, recovering that gap.

Context budget is additive. The listed VRAM figures cover model weights only. Every additional 1K tokens of context adds KV cache on top — plan accordingly on 16 GB machines running the 26B model.

vLLM format vs. llama.cpp format. For vLLM/SGLang use the w4a16-ct (compressed-tensors) checkpoints from Google's HuggingFace org. For llama.cpp/Ollama use Unsloth's GGUF variants. They are not interchangeable.

Why Local AI Matters in 2026

The economics of inference are shifting. At scale, per-token API costs add up quickly: a product making 10 million API calls per day at $0.003 per 1K tokens spends roughly $30,000 monthly on inference alone. Running Gemma 4 26B locally eliminates that line item entirely.

For developers in MENA building products that handle sensitive data — financial records, medical summaries, legal documents — local inference also eliminates the cross-border data transfer question. Your data never leaves your infrastructure.

Conclusion

Google Gemma 4 QAT is the most practical local-inference upgrade of 2026. A 26B-parameter multimodal model now fits on the same 16 GB developer laptop that struggled with 7B models two years ago, without sacrificing accuracy. Whether you reach for Ollama for quick iteration, llama.cpp for production control, vLLM for team APIs, or LiteRT-LM for mobile — the path to zero-per-token AI is now straightforward.

The QAT checkpoints are available on Hugging Face under the Google DeepMind org. Unsloth's dynamic GGUFs are the recommended starting point for llama.cpp deployments.