Moonshot AI Introduces Attention Residuals, a Major Rethink of Transformer Architecture

Moonshot AI, the Chinese company behind the Kimi chatbot and LLM family, has introduced a new architectural technique called Attention Residuals (AttnRes) that rethinks one of the most fundamental building blocks of the Transformer — the residual connection. The research, published on March 15, 2026, has already drawn widespread attention across the AI community, with Elon Musk calling it "impressive work."

What Are Attention Residuals?

Since the introduction of the Transformer in 2017, residual connections have been the standard mechanism for passing information between layers. Each layer simply adds its output to the running total from all previous layers — a fixed, uniform accumulation that treats every prior layer equally.

The problem, as Moonshot AI researchers explain, is that this approach causes hidden-state magnitudes to grow with depth, progressively diluting the contribution of early layers. Important signals from layer 2 or layer 20 get buried under dozens of subsequent additions — a phenomenon they call PreNorm dilution.

Attention Residuals replace this fixed accumulation with a lightweight depth-wise attention mechanism. Instead of blindly adding all previous outputs, each layer uses softmax attention to selectively retrieve information from the layers that matter most. Think of it as giving the model the ability to "look back" and decide which earlier layers contain the most relevant information for the current computation.

Key Highlights

Drop-in replacement for standard residual connections — no changes to the rest of the architecture required
Block AttnRes partitions layers into compressed blocks, reducing depth-wise memory complexity from O(Ld) to O(Nd), making it practical at scale
Only 2% parameter overhead — minimal impact on model size
Integrated into Kimi Linear, Moonshot AI's mixture-of-experts architecture with 48 billion total parameters and 3 billion activated parameters

Benchmark Results

Pre-trained on 1.4 trillion tokens, the AttnRes-enhanced Kimi Linear model showed consistent improvements across major benchmarks:

Benchmark	Before	After	Gain
MMLU	73.5	74.6	+1.1
GPQA-Diamond	36.9	44.4	+7.5
BBH	76.3	78.0	+1.7
Math	53.5	57.1	+3.6
HumanEval	59.1	62.2	+3.1
MBPP	72.0	73.9	+1.9

The most striking improvement came on GPQA-Diamond, a graduate-level science reasoning benchmark, where scores jumped by 7.5 points. Programming benchmarks like HumanEval and Math also saw meaningful gains, suggesting that selective layer retrieval particularly benefits complex reasoning tasks.

Why It Matters

Residual connections have remained essentially unchanged since their introduction in ResNet in 2015, later adopted by the original Transformer paper. This is one of the first successful attempts to fundamentally rethink how information flows between layers in deep networks.

The approach is especially significant because it achieves these gains without brute-force scaling. While the industry trend has been to train larger models on more data, Attention Residuals show that architectural innovations can still unlock meaningful performance improvements with negligible cost.

Open Source and Industry Reaction

Moonshot AI has released the code on GitHub, making AttnRes available for the broader research community to build upon. The technique's nature as a drop-in replacement means it could potentially be adopted by any Transformer-based model, from language models to vision transformers.

The research has generated significant buzz on social media, with AI researchers and engineers noting the elegance of treating depth as a sequence dimension — drawing a parallel between how attention operates across tokens in a sequence and how AttnRes operates across layers in a network.

What Comes Next

With open-source code available, the next question is whether major labs will adopt AttnRes in their next-generation models. The minimal overhead and consistent gains make it an attractive addition, particularly for models pushing into deeper architectures where PreNorm dilution becomes more pronounced.

Source: Moonshot AI — Attention Residuals

What Are Attention Residuals?

Key Highlights

Drop-in replacement for standard residual connections — no changes to the rest of the architecture required

Block AttnRes partitions layers into compressed blocks, reducing depth-wise memory complexity from O(Ld) to O(Nd), making it practical at scale

Only 2% parameter overhead — minimal impact on model size

Integrated into Kimi Linear, Moonshot AI's mixture-of-experts architecture with 48 billion total parameters and 3 billion activated parameters

Benchmark Results

Pre-trained on 1.4 trillion tokens, the AttnRes-enhanced Kimi Linear model showed consistent improvements across major benchmarks:

Benchmark

Before

After

Gain

MMLU

73.5

74.6

+1.1

GPQA-Diamond

36.9

44.4

+7.5

BBH

76.3

78.0

+1.7

Math

53.5

57.1

+3.6

HumanEval

59.1

62.2

+3.1

MBPP

72.0

73.9

+1.9

Why It Matters

Open Source and Industry Reaction

What Comes Next