Demystifying NVIDIA Nemotron 3 Super

120.6B total / 12.7B active parameters. Hybrid Mamba-2 + Transformer + LatentMoE. 1M token context window. Runs on a single H100-80GB at 4-bit quantization — available today on E2E Networks.

NVIDIA dropped Nemotron-3-Super at GTC on March 11, 2026. It's a 120B open hybrid reasoning MoE model that leads its size class on AIME 2025, SWE-Bench Verified, and Terminal Bench — while hitting 478 tok/s on B200, which is 7.5× the throughput of Qwen3.5-122B. This guide covers the architecture, what the hardware requirements actually are, and how to get it running.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

Architecture

Nemotron Super is not a standard transformer. It layers three distinct mechanisms:

1. Mamba-2 (SSM) layers — linear-time sequence modeling. Unlike attention, SSMs don't scale quadratically with context length. This is the reason the 1M token context window is computationally realistic and not just a marketing claim — the SSM layers use a recurrent KV cache where state carries forward, so context extension doesn't cause the cache to grow linearly the way it does in standard transformers. The OOM risk at 1M is compute overhead, not cache size.

2. Standard Transformer attention layers — interspersed with Mamba-2 blocks for local pattern capture and cross-token reasoning.

3. LatentMoE (Mixture of Experts) — 512 total experts, 22 active per forward pass. This is why the model is 120.6B total but only 12.7B active at any token step. The router selects 22 experts per token; everything else sits idle. Inference cost scales with active params, not total.

Multi-Token Prediction (MTP) — predicts multiple future tokens simultaneously during training. Improves sample efficiency and reasoning quality on multi-step tasks.

The model uses NoPE (No Positional Embeddings) — extending context only requires changing max_position_embeddings. No YaRN rope scaling needed.

Training Details

Attribute	Value
Pre-training tokens	25 trillion
Pre-training data cutoff	December 2025
Post-training / RLHF cutoff	February 2026
Native training precision	NVFP4 (Blackwell-native, first in family)
Languages	20
Context (default / max)	262,144 / 1,048,576 tokens

NVFP4 is a 4-bit floating point format introduced with Blackwell. Nemotron Super is the first NVIDIA model family trained natively in NVFP4 rather than post-hoc quantized. On B200/GB200 this enables 4× the throughput vs Hopper FP8. On H100/H200, you fall back to FP8 or quantized GGUF — NVFP4 kernel acceleration is Blackwell-only.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Claim Free Credits View Pricing

Inference Settings

NVIDIA's official recommendations:

# General chat / instruction
temperature = 1.0
top_p = 1.0

# Tool calling
temperature = 0.6
top_p = 0.95

For max_new_tokens: set between 32,768 and 262,144 for standard workloads. Setting context to 1M is possible but may trigger CUDA OOM on a single H100 — 262,144 is the safe ceiling for single-GPU deployment.

The model uses <think> (token ID 12) and </think> (token ID 13) for reasoning traces. In llama.cpp, pass --special --verbose-prompt to see these tokens in output.

Hardware Requirements

Most write-ups get this wrong. The actual breakdown by quantization:

Quantization	VRAM Required	Hardware
NVFP4 (native)	~32–40GB	B200 / GB200 only
4-bit GGUF (Q4)	~64–72GB	Single H100-80GB or H200 ✅
8-bit	~128GB	2× H100-80GB or single H200
BF16 (full precision)	~240GB+	8× H100-80GB
BF16 LoRA fine-tuning	256GB VRAM	Multi-GPU

The practical entry point is a single H100-80GB running Q4_K_XL GGUF. If you're already running Llama 3.1 70B on a single H100, Nemotron Super at 4-bit is a direct hardware swap. E2E Networks has H100-80GB nodes available on-demand — you can spin one up and have this model running in under 30 minutes.

⚠️ Don't set --ctx-size to 1M on a single H100. Use 262144 as the ceiling. Only push beyond that on H200 or a multi-GPU setup.

Running on E2E Networks

If you're on an E2E Networks GPU node, here's the fastest path to get Nemotron Super running.

llama.cpp / llama-server (single H100 — recommended starting point)

Build llama.cpp:

apt-get update && apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Run inference (chat):

./llama.cpp/llama-cli \
-hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 1.0 --top-p 1.0

Deploy as OpenAI-compatible server:

./llama.cpp/llama-server \
--model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
--alias "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B" \
--prio 3 \
--min_p 0.01 \
--temp 0.6 --top-p 0.95 \
--ctx-size 16384 \
--port 8001

Client:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")

completion = client.chat.completions.create(
model="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B",
messages=[{"role": "user", "content": "Your prompt here"}],
)
print(completion.choices[0].message.reasoning_content) # <think> trace
print(completion.choices[0].message.content) # final response

vLLM (multi-GPU / production throughput)

For multi-GPU H100 nodes on E2E Networks, vLLM with tensor parallelism is the cleaner production path:

pip install vllm

from vllm import LLM, SamplingParams

llm = LLM(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
tensor_parallel_size=2, # 2× H100 for 8-bit; 1 for Q4 GGUF
max_model_len=32768,
)

sampling_params = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=4096)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)

vLLM handles continuous batching and paged attention out of the box. Best choice for multi-tenant inference serving where you need to maximize throughput per GPU-hour.

TensorRT-LLM (maximum throughput on Hopper)

Highest raw throughput on H100/H200. Requires building a TRT-LLM engine for your specific GPU configuration — more setup overhead but worth it for sustained high QPS in production.

⚠️ TRT-LLM 1.3.0 engines are not compatible with Triton Inference Server 1.1.0. Pin your versions explicitly before spending time on engine builds.

SGLang

Competitive with vLLM for structured generation and multi-turn workloads. RadixAttention gives it an edge when prompt prefixes are frequently reused — long system prompts shared across many requests.

Benchmarks

Benchmark	Nemotron Super	GPT-OSS-120B	Qwen3.5-122B
AIME 2025	Best in class	—	—
Terminal Bench	Best in class	—	—
SWE-Bench Verified	Best in class	—	—
Throughput (tok/s, B200)	478	~217	~64

7.5× throughput over Qwen3.5-122B is a real architectural advantage from the MoE routing and SSM layers — not cherry-picked conditions. For a production endpoint, that gap directly translates to cost per token.

Nano vs Super — Which One to Run

	Nano (31.6B / 3.6B active)	Super (120.6B / 12.7B active)
Min GPU (4-bit)	Single A100-80GB	Single H100-80GB
Min GPU (BF16)	Single H100-80GB	8× H100-80GB
LatentMoE	❌	✅
MTP	❌	✅
Context window	1M	1M
API price (input/output)	$0.06 /$ 0.24 per 1M	$0.10 /$ 0.50 per 1M

If you're on an A100-80GB node on E2E Networks, Nano is the right call — it's built for that hardware tier and gives you 1M context at significantly lower cost. Super needs H100 minimum.

Fine-tuning

Unsloth supports both Nano and Super. Constraints for Super:

Router-layer fine-tuning is disabled by default for stability.
BF16 LoRA requires 256GB VRAM minimum — 4× H100-80GB or 2× H200.
For multi-GPU, add device_map="balanced" to distribute layers evenly.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
model_name="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
max_seq_length=32768,
load_in_4bit=False,
device_map="balanced",
)

E2E Networks multi-GPU H100 nodes give you the VRAM headroom to run LoRA fine-tuning on Super without needing to break the bank on reserved cloud instances.

License

NVIDIA Nemotron Open Model License. Commercially permissive, attribution required. Patent termination clause activates if you assert patent claims against NVIDIA. Not Apache 2.0 — read the full terms before shipping a product on top of it.

Summary

Nemotron Super's 12.7B active params out of 120.6B total means near-full-model reasoning quality at a fraction of the per-token compute cost. The SSM hybrid is what makes 1M context viable without KV cache explosion. The throughput numbers reflect real architectural wins.

The fastest path to running it: spin up a single H100-80GB node on E2E Networks, pull the Q4_K_XL GGUF from Unsloth's HF repo, serve via llama-server. You're looking at a 120B model with best-in-class reasoning benchmarks running in under 30 minutes from a cold start.

Sources: NVIDIA GTC 2026, Unsloth documentation, Together AI technical blog, HuggingFace model card (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16)

Demystifying NVIDIA Nemotron 3 Super

Get ₹2,000 free credits to test your AI workloads

Architecture

Training Details

Get ₹2,000 free credits to test your AI workloads

Inference Settings

Hardware Requirements

Running on E2E Networks

llama.cpp / llama-server (single H100 — recommended starting point)

vLLM (multi-GPU / production throughput)

TensorRT-LLM (maximum throughput on Hopper)

SGLang

Benchmarks

Nano vs Super — Which One to Run

Fine-tuning

License

Summary

Get ₹2,000 free credits to test your AI workloads

Related Articles

Running AI at Scale: The Infrastructure Reality Nobody Talks About

Scaling AI in production: What Nobody Tells You

Benchmarking Open ASR Models on NVIDIA L4: Parakeet vs Whisper vs Nemotron Speech

GPU Cloud

Company

Legal & Policies

Investor Relations

Resources