Inception Labs Launches Mercury 2: The Fastest Reasoning LLM Built on Diffusion Architecture

Inception Labs has launched Mercury 2, the world's first reasoning language model built on diffusion architecture, claiming it is over five times faster than leading speed-optimized LLMs while costing dramatically less to run.

Key Highlights

Mercury 2 achieves 1,009 tokens per second output throughput, compared to 89 tokens/s for Claude 4.5 Haiku and 71 tokens/s for GPT-5 Mini
End-to-end latency of just 1.7 seconds, versus 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5 with reasoning
Pricing starts at $0.25 per million input tokens and $0.75 per million output tokens — up to 75% cheaper than competitors

A New Architecture for Language Models

Unlike traditional autoregressive LLMs that generate text one token at a time, Mercury 2 uses a diffusion-based approach that refines multiple text blocks simultaneously. The concept is similar to how image diffusion models work: instead of writing word by word, the model works more like an editor revising an entire draft at once.

Founded by professors from Stanford, Cornell, and UCLA who pioneered foundational diffusion research, Inception Labs has successfully commercialized this architecture for text generation. Mercury 2 extends their earlier work into production-grade reasoning.

Benchmark Performance

On standard reasoning benchmarks, Mercury 2 shows competitive results:

AIME (Math): 91 — outperforming both Gemini 3 Flash (78) and Claude 4.5 Haiku (84)
GPQA Diamond (Science): 74
LCB (Code): 67
SciCode: 38

While Mercury 2 does not yet match frontier models like Claude 4.6 or GPT-5.3 on all benchmarks, it is reshaping the price-performance and latency-quality tradeoffs in AI inference.

Technical Specifications

Mercury 2 supports a 128K context window, tool usage, and JSON output. The model is available through an OpenAI-compatible API, making integration straightforward for developers already working with existing LLM toolchains.

Impact on the Industry

The implications are significant for real-time AI applications. Agent loops, voice interfaces, search systems, and coding assistants all benefit from faster inference. At current pricing, Mercury 2 could enable use cases that were previously too expensive or too slow to deploy at scale.

Andrej Karpathy, the former OpenAI researcher and Tesla AI director, is among the investors in Inception Labs — a sign that the diffusion approach to language modeling is gaining serious credibility in the AI research community.

What's Next

Mercury 2 is available today via the Inception API. If diffusion-based models continue to close the quality gap with frontier autoregressive models, they could fundamentally reshape the economics of large language model deployment.

Source: Business Wire