PixelRAG: UC Berkeley's Visual RAG Beats Text Parsers With 18% Accuracy Gain and 10× Cost Reduction

Researchers from UC Berkeley, Princeton, and EPFL have introduced PixelRAG, a retrieval-augmented generation system that reads web pages the same way humans do — by looking at them — instead of stripping their HTML into plain text. Across six benchmarks and 30 million screenshot tiles covering all of Wikipedia, PixelRAG delivers an 18.1% accuracy improvement over text-based RAG while cutting token costs by up to 10 times.

Key Highlights

PixelRAG renders pages as screenshots and indexes visual tiles instead of parsed text
Up to 18.1% accuracy improvement over text-based RAG on six benchmarks
10× lower token costs compared to legacy RAG pipelines
2–4× cheaper than Google Search while maintaining higher accuracy
Pre-indexed all 8.28 million Wikipedia pages in a publicly accessible API at api.pixelrag.ai
Fully open-source on GitHub at github.com/StarTrail-org/PixelRAG

The Problem With Text Parsing

Traditional RAG pipelines convert HTML pages into text before indexing — a step that systematically destroys information. Tables get flattened, visual layouts collapse, charts become invisible, and the spatial relationships between elements that carry meaning vanish entirely. In enterprise deployments, this silent degradation of source content is frequently the root cause of wrong answers from AI agents.

The PixelRAG team, co-led by Matei Zaharia (Databricks CTO and Apache Spark co-creator) alongside advisors from Berkeley's BAIR lab and NLP Group, quantified this problem at Wikipedia scale: the majority of incorrect answers in standard RAG systems trace directly to information lost during HTML-to-text conversion.

How PixelRAG Works

PixelRAG takes a fundamentally different approach. Rather than parsing markup, the pipeline:

Renders pages as screenshots using Playwright at an 875-pixel viewport
Slices each page into 1024-pixel-tall tiles for granular, fine-grained retrieval
Embeds tiles using Qwen3-VL-Embedding-2B, a vision-language model fine-tuned via LoRA on screenshot data
Indexes tiles in a FAISS approximate nearest-neighbor index (approximately 217 GB for all of Wikipedia)
Feeds retrieved tiles directly to a vision-language model reader that simultaneously processes visual layout and textual content

Both natural-language text queries and image queries are supported — an agent can search with a diagram, a screenshot crop, or plain English.

Benchmark Results

Tested across six retrieval benchmarks at scale:

Up to 18.1% accuracy improvement over text-based baselines
Up to 10× token cost reduction versus legacy RAG pipelines
2–4× lower costs than Google Search with better accuracy

Scale and Availability

The team pre-indexed all 8.28 million Wikipedia pages, building a FAISS index of approximately 217 GB. A hosted API endpoint is live at api.pixelrag.ai. The full framework — fine-tuned Qwen3-VL-Embedding-2B, Playwright rendering pipeline, FAISS indexer, and retrieval server — is open source on GitHub. The system runs on PyTorch 2.9.1, Transformers 4.57.1, and cuDNN 9.20, and supports incremental index updates without requiring a full re-index.

Why This Matters for AI Agents

For developers building AI agents that browse or search web content, PixelRAG addresses one of the most persistent quality gaps: agents silently failing to find or correctly interpret information that was visually structured at the source. Product comparison tables, financial reports with structured columns, mixed-language pages with Arabic RTL layouts, and infographic-heavy documentation all survive the PixelRAG pipeline with their structure intact.

The cost angle is equally significant. At 10× lower token costs than traditional RAG and 2–4× cheaper than Google Search, PixelRAG makes visual-native retrieval economically viable for high-volume agent workloads.

Research Background

The paper, "PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation," is authored by Yichuan Wang, Zhifei Li, Zirui Wang, Paul Teiletche, and Lesheng Jin, with advising from Matei Zaharia, Joseph E. Gonzalez, and Sewon Min. The team spans Berkeley SkyLab, BAIR (Berkeley AI Research), the Berkeley NLP Group, Princeton, and EPFL.

Source: VentureBeat