
NVIDIA Nemotron 3 Ultra is the most intelligent open-weights large language model released by a US AI lab as of June 2026. If you need the reasoning depth of a frontier closed model with the flexibility and cost advantages of an open architecture, Nemotron 3 Ultra is currently the strongest US-built option available.
Released on June 4, 2026, NVIDIA Nemotron 3 Ultra sets a new standard for what open-source AI can deliver — combining a 47.7 score on the Artificial Analysis Intelligence Index with inference speeds exceeding 400 output tokens per second. That combination — high intelligence and fast inference — is exactly what developers building production agentic pipelines have been waiting for.
What Is NVIDIA Nemotron 3 Ultra?
Definition: NVIDIA Nemotron 3 Ultra is a Mixture-of-Experts (MoE) open-weights large language model with approximately 550 billion total parameters and 55 billion active parameters at inference time.
As the largest and most capable model in NVIDIA’s Nemotron 3 family, it is designed to deliver frontier-class intelligence while remaining deployable at competitive speeds thanks to its sparse MoE architecture. Unlike dense models where all parameters activate for every token, Nemotron 3 Ultra routes each input through only a fraction of its total parameter count, enabling a dramatically better intelligence-per-compute ratio.
NVIDIA recommends the NVFP4 quantized weights for inference. Third-party benchmarking by Artificial Analysis confirmed that this quantization introduces minimal intelligence loss — the BF16 version scores 48.2 on the Intelligence Index vs. the NVFP4 score of 47.7, a difference small enough to be negligible in most real-world workloads.
Architecture at a Glance
- Total parameters: ~550 billion
- Active parameters: ~55 billion (MoE sparse activation)
- Recommended inference weights: NVFP4
- Non-hallucination score (AA-Omniscience): 71%
- GDPval-AA Elo (instruction following): 1,378
- Context length: Available at release; optimized for multi-turn agentic use cases
Key Benchmarks and Performance Data
Intelligence Index Rankings
On the Artificial Analysis Intelligence Index — a composite benchmark covering reasoning, coding, science, and instruction following — NVIDIA Nemotron 3 Ultra scores 47.7, placing it clearly ahead of every other US open-weights model:
| Model | Intelligence Index Score |
|---|---|
| NVIDIA Nemotron 3 Ultra | 47.7 |
| Gemma 4 31B | 39.2 |
| NVIDIA Nemotron 3 Super | 36.0 |
| gpt-oss-120b | 33.3 |
| Kimi K2.6 (Chinese, leading frontier) | 53.9 |
This gap is substantial. Nemotron 3 Ultra outperforms the next closest US open-weights model, Gemma 4 31B, by 8.5 points — a margin that translates to meaningfully better reasoning, multi-step instruction following, and factual precision in practice.
One nuance worth noting: Gemma 4 31B edges ahead of NVIDIA Nemotron 3 Ultra by approximately one point on the Coding Index specifically (a composite of Terminal-Bench Hard and SciCode). For pure code generation tasks, both models are worth evaluating side by side.
Agentic Task Performance
Artificial Analysis evaluated NVIDIA Nemotron 3 Ultra on Terminal-Bench v2.1 under restricted turn-budget conditions — a methodology designed to mirror real-world agent deployments where latency and token efficiency matter as much as raw accuracy.
The key finding: Nemotron 3 Ultra sits on the Pareto frontier for performance-versus-time. Across all four tested turn limits (10, 20, 50, and 100 turns), it completed tasks faster than every competing model while scoring competitively on benchmark accuracy. This places it in a unique position — it is not merely the smartest US open-weights model, it is the one most deployable at production speed.
The test methodology also accounted for fairness: the model was told its turn budget upfront (with a precise definition of what constitutes a “turn”) and received warnings as it approached the limit. This eliminates the bias that typically disadvantages models unaware of budget constraints.
Why Inference Speed Matters as Much as Intelligence
Question: Why does a 400+ tokens-per-second inference speed matter if the model is already the most intelligent US open-weights model available?
Direct answer: Because in agentic AI workloads, speed directly controls the cost and latency of every task — and a slower, smarter model often produces worse real-world outcomes than a faster model of similar capability.
Consider the math: an agentic coding pipeline that requires 50 turns per task at 100 tokens per turn needs 5,000 output tokens per task completion. At 100 tokens/second, that takes 50 seconds per task. At 400+ tokens/second, it takes under 13 seconds — a 4x throughput improvement with identical intelligence.
NVIDIA Nemotron 3 Ultra achieves this by combining its MoE sparse architecture (only 55B active parameters per forward pass) with inference optimization techniques developed by NVIDIA for its own hardware. The result is a model that, as measured on BlackBox AI pre-release infrastructure, runs slightly faster than gpt-oss-120b despite being more than four times larger in total parameter count — while delivering significantly higher benchmark scores.
This positions Nemotron 3 Ultra at a point on the speed-intelligence Pareto frontier that no US open-weights model has previously reached.
NVIDIA Nemotron 3 Ultra vs. Competing Open Weights Models
How does NVIDIA Nemotron 3 Ultra compare to its closest US and international open-weights rivals across the dimensions that matter most for enterprise and developer adoption?
| Feature | Nemotron 3 Ultra | Gemma 4 31B | Nemotron 3 Super | gpt-oss-120b | Kimi K2.6 |
|---|---|---|---|---|---|
| Intelligence Index | 47.7 | 39.2 | 36.0 | 33.3 | 53.9 |
| Inference Speed | 400+ tok/s | Not benchmarked | High | ~400 tok/s | Not benchmarked |
| Total Parameters | ~550B | 31B | Smaller | 120B | Undisclosed |
| Active Parameters | ~55B (MoE) | 31B (dense) | Smaller | 120B (dense) | MoE |
| Coding Index | Strong (2nd) | Leading US | Below average | Below average | N/A |
| Agentic Benchmark | Pareto frontier | Not evaluated | Competitive | Below Nemotron Ultra | Not evaluated |
| Non-hallucination | 71% | Not reported | Lower | Not reported | Not reported |
| Origin | US (NVIDIA) | US (Google) | US (NVIDIA) | US (OpenAI) | China (Moonshot AI) |
| Weights availability | Open | Open | Open | Open | Open |
Key takeaway: For most enterprise use cases — especially those involving agentic pipelines, multi-step reasoning, or production-speed inference requirements — NVIDIA Nemotron 3 Ultra is the clear US open-weights choice. The only scenario where a competitor holds a clear edge is narrow coding benchmarks (Gemma 4 31B) or overall frontier intelligence where Chinese open models currently lead.
Who Should Use NVIDIA Nemotron 3 Ultra?
NVIDIA Nemotron 3 Ultra is best suited for teams and use cases where the combination of high intelligence, fast inference, and open weights provides strategic advantage. Specifically:
- Enterprise AI teams building multi-step agentic workflows who need low latency without sacrificing reasoning quality
- AI infrastructure teams running on NVIDIA hardware who want to take full advantage of NVFP4 inference optimization
- Researchers and developers who require a top-tier reasoning model but need the flexibility to self-host, fine-tune, or audit model weights
- Organizations with data sovereignty concerns that prevent use of closed-source APIs like GPT-4o or Claude
- Product teams building coding assistants that need strong general intelligence with competitive (though not leading) coding performance
- Cost-conscious ML teams seeking frontier-class model intelligence without the per-token pricing of proprietary closed APIs
- Agentic framework developers testing model capability under turn and token budget constraints, where speed-to-solution is a primary metric
If your primary workload is pure code generation and you are less sensitive to inference speed, Gemma 4 31B is worth benchmarking alongside Nemotron 3 Ultra before committing. For all other use cases, NVIDIA Nemotron 3 Ultra is the strongest open-weights option currently available from a US lab.
Limitations and Honest Caveats
No model review is complete without addressing where NVIDIA Nemotron 3 Ultra falls short.
It Does Not Lead the Global Open Weights Frontier
Chinese open-weights models currently hold the overall intelligence lead. Kimi K2.6 from Moonshot AI scores 53.9 on the Artificial Analysis Intelligence Index — 6.2 points ahead of Nemotron 3 Ultra. For organizations with no procurement restrictions on international model providers, the Chinese-led open weights frontier currently offers higher raw intelligence. The question of whether that gap justifies the tradeoffs in compliance, data handling, and vendor trust is a decision each organization must make for itself.
Coding Performance Is Not the Best Available
Despite its overall strength, NVIDIA Nemotron 3 Ultra does not lead on the Coding Index. Gemma 4 31B scores approximately one point higher on a composite of Terminal-Bench Hard and SciCode. This is a narrow gap and may not translate to real-world code quality differences for most tasks, but it is a measurable one.
Graduate-Level Physics Remains a Hard Problem
Performance on CritPt — a benchmark for graduate-level physics research problems — sits at 3%, matching Nemotron 3 Super rather than improving on it. This suggests that extremely deep domain-specific scientific reasoning remains a genuine limitation, consistent with other frontier models.
Verbosity Is Moderate, Not Minimal
Nemotron 3 Ultra uses roughly 1 million fewer output tokens than Nemotron 3 Super to run the full Intelligence Index suite, placing it centrally among peers on verbosity. For applications where extremely concise outputs are critical (e.g., mobile inference, low-bandwidth deployments), this is worth profiling against your specific workload.
How to Access NVIDIA Nemotron 3 Ultra
Question: How can I start using NVIDIA Nemotron 3 Ultra?
Direct answer: NVIDIA Nemotron 3 Ultra is available as open weights, with initial inference access provided through BlackBox AI ahead of the public release. NVIDIA recommends the NVFP4 weights for production inference.
Access channels at launch include:
- BlackBox AI — the first provider to offer hosted inference, with measured speeds above 400 tokens/second
- NVIDIA NGC (NVIDIA GPU Cloud) — NVIDIA’s own model hub for downloading weights and accessing deployment documentation
- Hugging Face — standard open-weights distribution channel for community access and fine-tuning workflows
- Self-hosted deployment — fully possible with sufficient GPU infrastructure; NVFP4 weights are optimized for NVIDIA H100/H200-class hardware
Because the weights are fully open, NVIDIA Nemotron 3 Ultra can be fine-tuned on proprietary datasets, deployed behind private infrastructure, and integrated into custom inference pipelines without ongoing API dependency.
The Bigger Picture: US vs. Chinese Open Weights Frontier
The release of NVIDIA Nemotron 3 Ultra is significant beyond the benchmark numbers — it represents a meaningful moment in the competitive dynamics of open-weights AI development.
For most of 2024 and 2025, the narrative around open-weights models was dominated by Meta’s Llama family and, increasingly, Chinese labs like DeepSeek, Alibaba (Qwen), and Moonshot AI (Kimi). Those Chinese-led models currently still hold the top spot on the Artificial Analysis Intelligence Index — Kimi K2.6 at 53.9 remains the open-weights frontier leader as of this writing.
But NVIDIA Nemotron 3 Ultra changes the conversation. It is the first US open-weights model to close meaningfully on the Chinese frontier, reaching 47.7 — a score that would have represented the leading open-weights model globally just eighteen months ago. More importantly, it achieves this while excelling on inference speed, a dimension where US labs have historically struggled relative to their intelligence scores.
This matters for enterprise AI adoption. The availability of a high-intelligence, fast, open-weights US model reduces the pressure to choose between compliance-friendly providers and frontier-level capability. For the first time in a while, those two requirements are available in the same package from a US lab.
Whether NVIDIA can close the remaining ~6-point gap to the Chinese open-weights frontier — and at what model size and cost — will be one of the more interesting storylines to track through the rest of 2026.
Conclusion
NVIDIA Nemotron 3 Ultra is the most capable open-weights large language model released by a US laboratory, and its combination of high intelligence and leading inference speed makes it genuinely production-ready for agentic AI workloads in a way that previous US open models were not.
Its 47.7 score on the Artificial Analysis Intelligence Index, 400+ tokens-per-second inference speed, and Pareto-frontier performance on agentic benchmarks represent a significant step forward — both for NVIDIA’s model portfolio and for the US open-weights AI ecosystem as a whole.
For teams evaluating open-weights models in 2026, NVIDIA Nemotron 3 Ultra should be the default starting point if you are building anything beyond basic text generation. Its limitations are real but narrow: slightly behind Gemma 4 31B on coding benchmarks, and still behind the Chinese open-weights frontier on raw intelligence. For everything else, it currently leads.
Frequently Asked Questions
What makes NVIDIA Nemotron 3 Ultra different from Nemotron 3 Super? Nemotron 3 Ultra is significantly larger (~550B total parameters vs. the smaller Super), scores 11.7 points higher on the Artificial Analysis Intelligence Index (47.7 vs. 36.0), and is optimized for faster inference while also using fewer output tokens to complete benchmark tasks.
Is NVIDIA Nemotron 3 Ultra fully open source? The model weights are fully open, meaning they can be downloaded, self-hosted, and fine-tuned. “Open weights” is technically distinct from “open source” (the training data and full pipeline may not be public), but for deployment purposes the distinction rarely matters — the weights are freely available.
How does NVIDIA Nemotron 3 Ultra compare to GPT-4o or Claude? Nemotron 3 Ultra is an open-weights model and is not directly benchmarked against closed proprietary models like GPT-4o or Claude in this analysis. Its 47.7 Intelligence Index score places it well within the range of capable frontier models, though closed-source leaders typically score higher on composite benchmarks.
What hardware do I need to run NVIDIA Nemotron 3 Ultra? Given the ~550B total parameter count, self-hosting requires substantial GPU infrastructure (multiple H100 or H200 GPUs). NVIDIA’s NVFP4 quantization significantly reduces memory requirements compared to BF16, making deployment more accessible without meaningful loss in intelligence.