NVIDIA Nemotron TwoTower: Diffusion LLM Guide

For a decade, large language models have decoded text the same way: one token at a time, left to right. That single design choice is also the single biggest bottleneck in generation speed. On July 1, 2026, NVIDIA released Nemotron-Labs-TwoTower, an open-weight diffusion language model that breaks the one-token-at-a-time rule and generates text 2.42x faster while keeping 98.7% of a strong autoregressive model's benchmark quality.

This is one of the first production-grade, open-weight diffusion LLMs from a major lab. Here is what it is, why it matters, and how to run it.

The bottleneck TwoTower attacks

Autoregressive (AR) models — GPT, Claude, Llama, and almost everything you use — predict the next token conditioned on all previous tokens. Each token requires a full forward pass, and you cannot start token 100 until token 99 exists. Generation is inherently sequential, so throughput is capped no matter how much hardware you throw at it.

Diffusion language models take a different route. Instead of decoding left to right, they start with a block of masked placeholder tokens and iteratively refine them in parallel, committing several tokens per step. The idea has been promising in research for a while, but quality consistently lagged behind AR models. TwoTower is notable because it closes almost the entire gap.

How the two towers work

The clever part is in the name. TwoTower does not train a diffusion model from scratch — it reuses an existing, already-trained autoregressive model.

Context tower (frozen): an off-the-shelf Nemotron-3-Nano-30B-A3B backbone, pretrained on 25 trillion tokens. Its weights are never touched.
Denoiser tower (trained): a second copy that learns to turn masked noise into clean tokens, guided at every layer by the frozen context tower through cross-attention.

Because only the denoiser is trained, the whole model reaches AR-level quality after just 2.1 trillion tokens of training — a fraction of the 25T the backbone originally consumed. You get diffusion speed without re-paying the full pretraining bill.

Each tower is a hybrid stack of 52 layers: 23 Mamba-2 layers, 6 self-attention layers, and 23 mixture-of-experts (MoE) layers. The MoE config routes to 6 of 128 experts plus 2 shared experts per token, so despite roughly 60B total parameters across both towers, only about 3B parameters are active per token per tower. A tiny time-conditioning module (around 1.5M parameters) tells the denoiser which diffusion step it is on.

Block-wise decoding, step by step

Generation happens block by block rather than token by token:

A block of S positions (default 16) starts as [MASK] tokens.
The denoiser runs T refinement steps (default 16) over that block.
Within a block, attention is bidirectional; past blocks are attended to causally.
Layer-aligned cross-attention pulls context from the frozen tower at every layer.
High-confidence tokens are committed early — often several per step — instead of one.

That "commit many tokens per step" behavior is where the 2.42x wall-clock speedup comes from. A confidence_threshold knob lets you trade quality for speed: commit more aggressively for faster output, or refine longer for higher fidelity.

The numbers that matter

At default settings (confidence 0.8, block size 16), TwoTower retains 98.7% of the AR baseline's aggregate quality. The per-task breakdown:

Task	AR baseline	TwoTower
MMLU (5-shot)	78.56	78.24
ARC-Challenge	91.72	92.66
HumanEval	79.27	75.58
GSM8K (8-shot)	92.49	90.14
MATH-500	84.40	80.60

The trade is roughly 1.3% quality for 2.42x speed. General knowledge (MMLU) and reasoning-lite tasks (ARC) barely move; code and hard math take the biggest hit, which is expected — those tasks are the least forgiving of a token committed too early.

Running it

TwoTower ships under the NVIDIA Nemotron Open Model License (commercial use permitted) on Hugging Face. The full two-tower model needs 2 GPUs at roughly 59GB each in BF16; an AR-only mode runs on a single 80GB GPU.

Load the model and place each tower on its own device:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
 
model_name = "nvidia/Nemotron-Labs-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
 
# One tower per GPU
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()

Now generate with parallel block-wise diffusion decoding:

prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
 
outputs = model.generate_mask_diffusion(
    inputs["input_ids"],
    max_new_tokens=128,
    block_size=16,          # positions refined together
    steps_per_block=16,     # refinement iterations
    mask_token_id=3,
    temperature=0.1,
    confidence_threshold=0.8,  # lower = faster, higher = safer
    eos_token_id=tokenizer.eos_token_id,
)
 
text = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)
print(text)

The release exposes three inference modes so you can benchmark the trade-offs on your own hardware:

generate_mask_diffusion() — the fast, parallel block-wise path.
generate_mock_ar() — two-tower, one token per step (a fair speed baseline).
generate_ar() — the frozen context tower alone, single-GPU.

A practical tuning tip: raise confidence_threshold toward 0.9 and increase steps_per_block for code and math, where an early wrong token is costly. For chat, summaries, and drafting, lower thresholds unlock most of the speed with little visible loss.

Why this matters for builders in MENA

Two forces make a fast, open-weight model like this strategically interesting for teams in Tunisia, Saudi Arabia, and the wider region.

First, inference cost is the recurring bill, not training. A 2.42x throughput gain on the same GPUs means roughly the same served volume on fewer accelerators — a direct answer to the GPU scarcity and power constraints that regional data centers face. Diffusion decoding turns spare compute into speed rather than needing more silicon.

Second, open weights mean self-hosting is a real fallback. With frontier hosted models increasingly gated by export controls and customer-by-customer approvals, a permissively licensed model you can download, audit, and run inside your own trust boundary is no longer a nice-to-have — it is a resilience strategy. TwoTower is a serious open-weight option that happens to be fast.

The bigger shift

TwoTower will not replace your production AR endpoint tomorrow. It needs two GPUs, code and math quality dips slightly, and the tooling ecosystem around diffusion LLMs is young. But it is a proof point that matters: you can bolt a trained diffusion decoder onto a frozen, already-pretrained backbone and buy a 2.4x speedup for about 1% quality — without retraining the expensive part.

If the next generation of open models ships an AR checkpoint and a matching diffusion denoiser, "how fast can you decode" stops being a fixed property of the architecture and becomes a dial you turn per workload. That is the interesting future TwoTower points at.

Related reading on noqta.tn: Self-hosted LLMs with Ollama, vLLM production serving guide, and NVIDIA Nemotron 3 Ultra open weights.

Sources: MarkTechPost, arXiv 2606.26493, Hugging Face model card, NVIDIA Nemotron 3.