At Computex 2026 in Taipei, Jensen Huang walked onto stage and announced what is shaping up to be the most significant open-weights model release of the year: Nemotron 3 Ultra. With 550 billion parameters and a design purpose-built for agentic AI workflows, NVIDIA is signaling that it is no longer just a chip company — it is now a full-stack AI platform.
This guide covers everything developers need to know: the architecture, the benchmarks, how to access the model, and how to build agent pipelines on top of it.
What Is Nemotron 3 Ultra?
Nemotron 3 Ultra is the flagship model in NVIDIA's open Nemotron 3 family. It ships with:
- 550 billion total parameters, of which only 55 billion activate per token (mixture-of-experts efficiency)
- Hybrid Mamba-Transformer architecture — combining selective state-space layers with standard attention blocks
- 1 million token context window — natively supported, without additional cost
- NVFP4 training and quantization — enabling both high-fidelity BF16 and memory-efficient 4-bit deployments
- Fully open weights, training data, and code — releasing this week on Hugging Face and NGC
Jensen Huang summarized the intent: "We're dedicated to building open models for the world, so you can take all of it, add to it, make it even better, make it yours."
A New Benchmark Ceiling for Open Models
NVIDIA partnered with Artificial Analysis to evaluate Nemotron 3 Ultra before launch. The results establish it as the most capable US open-weights model available today:
- Score 48 on the Artificial Analysis Intelligence Index — topping every other US open-weights model
- More than 300 output tokens per second on Hopper-class hardware
- 5x higher throughput compared to the Nemotron 3 Super at equivalent hardware
- Roughly 30% lower cost-per-inference versus leading open alternatives
To put those numbers in context: running a 1M-token agentic workflow that might cost several dollars on a proprietary API becomes dramatically cheaper on self-hosted Ultra deployments. For enterprises with high-volume inference needs, that gap compounds quickly.
The Nemotron 3 Family at a Glance
NVIDIA designed the three-tier family so teams can match compute budget to task complexity:
| Model | Parameters | Active Params | Best For |
|---|---|---|---|
| Nano Omni | 8B | 8B | Edge agents, mobile, real-time use |
| Super | 120B | ~25B | Mid-range enterprise, cost-sensitive |
| Ultra | 550B | 55B | Maximum reasoning, complex planning |
The family shares a common API surface and weight format, so migrating between tiers is a configuration change, not a rewrite.
Architecture Deep Dive: Why MoE + Mamba?
The latent mixture-of-experts design is the key to Ultra's economics. Rather than activating all 550 billion parameters for every token, the model routes each token to the most relevant subset of expert layers. The result: a model that reasons at frontier quality while paying the compute cost of a much smaller model at inference time.
The Mamba layers (selective state-space models) handle long-range dependencies more efficiently than full self-attention on very long sequences. At 1M context, this matters enormously — Transformer-only models struggle with quadratic attention costs at that scale, while Mamba's near-linear recurrence keeps memory and latency tractable.
The hybrid design marries Mamba's efficiency for long contexts with Transformer attention's superiority on complex reasoning tasks. NVIDIA calls this the architecture that enables "agentic AI at planetary scale."
Getting Started: Accessing the Model
Nemotron 3 Ultra will be available through multiple channels this week:
Via NVIDIA API Catalog (fastest path):
pip install openai # Nemotron uses the OpenAI-compatible API specfrom openai import OpenAI
client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="YOUR_NVIDIA_API_KEY"
)
response = client.chat.completions.create(
model="nvidia/nemotron-3-ultra-550b-instruct",
messages=[
{"role": "system", "content": "You are an expert software architect."},
{"role": "user", "content": "Design a fault-tolerant microservices architecture for an e-commerce platform."}
],
temperature=0.2,
max_tokens=4096
)
print(response.choices[0].message.content)Via Hugging Face (self-hosted):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "nvidia/Nemotron-3-Ultra-550B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Requires multi-GPU setup — e.g., 8x H100 for BF16
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)For NVFP4 quantized inference, NVIDIA's TensorRT-LLM engine provides the optimal path and is documented in the NGC model card.
Building Agentic Workflows
Nemotron 3 Ultra was benchmarked and optimized specifically for multi-step agentic tasks — the workflows where the model must plan, use tools, evaluate intermediate results, and iterate to a final answer.
The NVIDIA Agent Toolkit provides the production runtime for this:
from nvidia_agent_toolkit import AgentRuntime, Tool
@Tool.define(description="Search the web for real-time information")
def web_search(query: str) -> str:
# your implementation
...
agent = AgentRuntime(
model="nvidia/nemotron-3-ultra-550b-instruct",
tools=[web_search],
context_window=1_000_000,
max_iterations=20
)
result = agent.run(
"Research the latest NVIDIA GPU pricing trends and produce a cost-benefit analysis for upgrading our inference cluster."
)The toolkit handles tool-call parsing, iteration management, and context compression automatically — so Ultra's 1M-token window is used efficiently across long multi-turn agent sessions.
Practical Agentic Use Cases
The 1M context window combined with Ultra's reasoning depth unlocks several high-value enterprise patterns:
Codebase-wide analysis: Feed an entire repository (200K+ tokens of code) into context and ask Ultra to identify security vulnerabilities, refactor opportunities, or architectural inconsistencies — in a single pass.
Long document synthesis: Legal contracts, research corpora, financial filings that previously required chunking and RAG can now be reasoned over holistically.
Multi-step research agents: A self-directed research loop that searches, reads, synthesizes, and produces structured reports with minimal human checkpoints.
Autonomous code generation: Generate, execute, debug, and iterate on code within a single context window — exactly the use case NVIDIA optimized the training signal for.
Local Deployment: RTX Spark and DGX Options
For teams that need on-premises inference, NVIDIA announced two deployment paths at Computex:
DGX Spark — A compact desktop supercomputer designed for developers. Runs the full Nemotron 3 family. Targeted at research teams and individual power users who need local, private inference.
RTX Spark (with MediaTek + Microsoft) — A consumer-grade AI PC chip delivering 1 petaflop of AI performance in slim laptops. Runs the Nano and Super tiers locally. Ultra requires server-class hardware.
For cloud deployment, all major providers (AWS, Azure, GCP, OCI) will support Nemotron 3 Ultra through their AI marketplace integrations.
Why This Release Changes the Open-Source AI Landscape
The gap between closed frontier models and the best open-weights alternatives has been shrinking for two years. Nemotron 3 Ultra may be the model that closes it for agentic reasoning workloads.
Three factors make this release stand out:
-
Fully open: weights, training data, and code — not just the weights. This enables fine-tuning, post-training, and architectural modification at a level competitors do not offer.
-
Enterprise-grade throughput: 300+ tokens/sec and 5x inference speedup make production deployment viable without the GPU count that comparable models require.
-
NVIDIA ecosystem integration: Native integration with TensorRT-LLM, NIM microservices, the Agent Toolkit, and RTX hardware means Ultra benefits from years of NVIDIA optimization that pure research releases lack.
Over 50 million downloads of Nemotron 3 models were recorded in the 12 months leading up to this launch — indicating a developer community already invested in the ecosystem.
Conclusion
Nemotron 3 Ultra is not a research paper or a limited preview — it is a production-ready, fully open model that arrives this week with a complete deployment stack. For developers building agentic AI applications, the combination of 1M context, MoE efficiency, and NVIDIA's inference infrastructure is a compelling alternative to closed APIs.
Whether you are running agents on a DGX cluster or experimenting with the API catalog, Nemotron 3 Ultra deserves a place in your model evaluation pipeline starting today.