120.6B total / 12.7B active parameters. Hybrid Mamba-2 + Transformer + LatentMoE. 1M token context window. Runs on a single H100-80GB at 4-bit quantization — available today on E2E Networks.
NVIDIA dropped Nemotron-3-Super at GTC on March 11, 2026. It's a 120B open hybrid reasoning MoE model that leads its size class on AIME 2025, SWE-Bench Verified, and Terminal Bench — while hitting 478 tok/s on B200, which is 7.5× the throughput of Qwen3.5-122B. This guide covers the architecture, what the hardware requirements actually are, and how to get it running.
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
Architecture
Nemotron Super is not a standard transformer. It layers three distinct mechanisms:
1. Mamba-2 (SSM) layers — linear-time sequence modeling. Unlike attention, SSMs don't scale quadratically with context length. This is the reason the 1M token context window is computationally realistic and not just a marketing claim — the SSM layers use a recurrent KV cache where state carries forward, so context extension doesn't cause the cache to grow linearly the way it does in standard transformers. The OOM risk at 1M is compute overhead, not cache size.
2. Standard Transformer attention layers — interspersed with Mamba-2 blocks for local pattern capture and cross-token reasoning.
3. LatentMoE (Mixture of Experts) — 512 total experts, 22 active per forward pass. This is why the model is 120.6B total but only 12.7B active at any token step. The router selects 22 experts per token; everything else sits idle. Inference cost scales with active params, not total.
Multi-Token Prediction (MTP) — predicts multiple future tokens simultaneously during training. Improves sample efficiency and reasoning quality on multi-step tasks.
The model uses NoPE (No Positional Embeddings) — extending context only requires changing max_position_embeddings. No YaRN rope scaling needed.
Training Details
| Attribute | Value |
|---|---|
| Pre-training tokens | 25 trillion |
| Pre-training data cutoff | December 2025 |
| Post-training / RLHF cutoff | February 2026 |
| Native training precision | NVFP4 (Blackwell-native, first in family) |
| Languages | 20 |
| Context (default / max) | 262,144 / 1,048,576 tokens |
NVFP4 is a 4-bit floating point format introduced with Blackwell. Nemotron Super is the first NVIDIA model family trained natively in NVFP4 rather than post-hoc quantized. On B200/GB200 this enables 4× the throughput vs Hopper FP8. On H100/H200, you fall back to FP8 or quantized GGUF — NVFP4 kernel acceleration is Blackwell-only.
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
Inference Settings
NVIDIA's official recommendations:
# General chat / instruction
temperature = 1.0
top_p = 1.0
# Tool calling
temperature = 0.6
top_p = 0.95
For max_new_tokens: set between 32,768 and 262,144 for standard workloads. Setting context to 1M is possible but may trigger CUDA OOM on a single H100 — 262,144 is the safe ceiling for single-GPU deployment.
The model uses <think> (token ID 12) and </think> (token ID 13) for reasoning traces. In llama.cpp, pass --special --verbose-prompt to see these tokens in output.
Hardware Requirements
Most write-ups get this wrong. The actual breakdown by quantization:
| Quantization | VRAM Required | Hardware |
|---|---|---|
| NVFP4 (native) | ~32–40GB | B200 / GB200 only |
| 4-bit GGUF (Q4) | ~64–72GB | Single H100-80GB or H200 ✅ |
| 8-bit | ~128GB | 2× H100-80GB or single H200 |
| BF16 (full precision) | ~240GB+ | 8× H100-80GB |
| BF16 LoRA fine-tuning | 256GB VRAM | Multi-GPU |
The practical entry point is a single H100-80GB running Q4_K_XL GGUF. If you're already running Llama 3.1 70B on a single H100, Nemotron Super at 4-bit is a direct hardware swap. E2E Networks has H100-80GB nodes available on-demand — you can spin one up and have this model running in under 30 minutes.
⚠️ Don't set --ctx-size to 1M on a single H100. Use 262144 as the ceiling. Only push beyond that on H200 or a multi-GPU setup.
Running on E2E Networks
If you're on an E2E Networks GPU node, here's the fastest path to get Nemotron Super running.
llama.cpp / llama-server (single H100 — recommended starting point)
Build llama.cpp:
apt-get update && apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
Run inference (chat):
./llama.cpp/llama-cli \
-hf unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 1.0 --top-p 1.0
Deploy as OpenAI-compatible server:
./llama.cpp/llama-server \
--model unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF/UD-Q4_K_XL/NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q4_K_XL-00001-of-00003.gguf \
--alias "unsloth/NVIDIA-Nemotron-3-Super-120B-A12B" \
--prio 3 \
--min_p 0.01 \
--temp 0.6 --top-p 0.95 \
--ctx-size 16384 \
--port 8001
Client:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
completion = client.chat.completions.create(
model="unsloth/NVIDIA-Nemotron-3-Super-120B-A12B",
messages=[{"role": "user", "content": "Your prompt here"}],
)
print(completion.choices[0].message.reasoning_content) # <think> trace
print(completion.choices[0].message.content) # final response
vLLM (multi-GPU / production throughput)
For multi-GPU H100 nodes on E2E Networks, vLLM with tensor parallelism is the cleaner production path:
pip install vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
tensor_parallel_size=2, # 2× H100 for 8-bit; 1 for Q4 GGUF
max_model_len=32768,
)
sampling_params = SamplingParams(temperature=1.0, top_p=1.0, max_tokens=4096)
outputs = llm.generate(["Your prompt here"], sampling_params)
print(outputs[0].outputs[0].text)
vLLM handles continuous batching and paged attention out of the box. Best choice for multi-tenant inference serving where you need to maximize throughput per GPU-hour.
TensorRT-LLM (maximum throughput on Hopper)
Highest raw throughput on H100/H200. Requires building a TRT-LLM engine for your specific GPU configuration — more setup overhead but worth it for sustained high QPS in production.
⚠️ TRT-LLM 1.3.0 engines are not compatible with Triton Inference Server 1.1.0. Pin your versions explicitly before spending time on engine builds.
SGLang
Competitive with vLLM for structured generation and multi-turn workloads. RadixAttention gives it an edge when prompt prefixes are frequently reused — long system prompts shared across many requests.
Benchmarks
| Benchmark | Nemotron Super | GPT-OSS-120B | Qwen3.5-122B |
|---|---|---|---|
| AIME 2025 | Best in class | — | — |
| Terminal Bench | Best in class | — | — |
| SWE-Bench Verified | Best in class | — | — |
| Throughput (tok/s, B200) | 478 | ~217 | ~64 |
7.5× throughput over Qwen3.5-122B is a real architectural advantage from the MoE routing and SSM layers — not cherry-picked conditions. For a production endpoint, that gap directly translates to cost per token.
Nano vs Super — Which One to Run
| Nano (31.6B / 3.6B active) | Super (120.6B / 12.7B active) | |
|---|---|---|
| Min GPU (4-bit) | Single A100-80GB | Single H100-80GB |
| Min GPU (BF16) | Single H100-80GB | 8× H100-80GB |
| LatentMoE | ❌ | ✅ |
| MTP | ❌ | ✅ |
| Context window | 1M | 1M |
| API price (input/output) | 0.24 per 1M | 0.50 per 1M |
If you're on an A100-80GB node on E2E Networks, Nano is the right call — it's built for that hardware tier and gives you 1M context at significantly lower cost. Super needs H100 minimum.
Fine-tuning
Unsloth supports both Nano and Super. Constraints for Super:
- Router-layer fine-tuning is disabled by default for stability.
- BF16 LoRA requires 256GB VRAM minimum — 4× H100-80GB or 2× H200.
- For multi-GPU, add device_map="balanced" to distribute layers evenly.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="nvidia/NVIDIA-Nemotron-3-Super-120B-A12B",
max_seq_length=32768,
load_in_4bit=False,
device_map="balanced",
)
E2E Networks multi-GPU H100 nodes give you the VRAM headroom to run LoRA fine-tuning on Super without needing to break the bank on reserved cloud instances.
License
NVIDIA Nemotron Open Model License. Commercially permissive, attribution required. Patent termination clause activates if you assert patent claims against NVIDIA. Not Apache 2.0 — read the full terms before shipping a product on top of it.
Summary
Nemotron Super's 12.7B active params out of 120.6B total means near-full-model reasoning quality at a fraction of the per-token compute cost. The SSM hybrid is what makes 1M context viable without KV cache explosion. The throughput numbers reflect real architectural wins.
The fastest path to running it: spin up a single H100-80GB node on E2E Networks, pull the Q4_K_XL GGUF from Unsloth's HF repo, serve via llama-server. You're looking at a 120B model with best-in-class reasoning benchmarks running in under 30 minutes from a cold start.
Sources: NVIDIA GTC 2026, Unsloth documentation, Together AI technical blog, HuggingFace model card (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-Base-BF16)


