NVIDIA A30 vs V100 for LLM Inference: vLLM, TGI, TensorRT-LLM Benchmark (7B–70B Models)

EN
E2E Networks

Content Team @ E2E Networks

March 24, 2026·41 min read
Share this article
Link copied to clipboard
Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Overview

As LLMs move from research artifacts into production systems, the infrastructure question becomes as critical as the model question. Choosing the wrong GPU or serving stack doesn't just cost money — it determines which models you can run at all, how fast you can serve them, and whether your latency targets are even achievable.

Most benchmarks compare GPUs in isolation or focus on training efficiency. What matters in real deployment is the full picture: which stacks actually run on your hardware, how they compare in throughput and latency under concurrent load, and whether the cost tradeoffs justify the upgrade path.

This report benchmarks 7 LLM models ranging from 7B to 70B parameters across 4 serving stacks — vLLM, TGI, TensorRT-LLM, and Triton — on two GPU clusters: 4x NVIDIA A30 (24 GB) and 4x Tesla V100 (32 GB). The objective is to produce a complete, engineering-grade reference for teams deciding between these platforms for LLM inference workloads.

Key Findings

  • vLLM is the only stack that runs on both A30 and V100
  • A30 outperforms V100 by 24–35% in throughput despite less VRAM per GPU
  • Stack ranking on A30: vLLM > TRT-LLM > TGI — vLLM leads TRT-LLM by 8–16% and TGI by 25–41%
  • A30 is 18–34% cheaper per million tokens than V100
  • BF16 offers negligible benefit over FP16 on A30 (< 1%), and is completely unavailable on V100
  • Scaling efficiency is 61–74% from 1 to 4 GPUs over PCIe
  • vLLM TTFT improves dramatically with more GPUs (54ms → 23ms for 7B on 1→4 GPU); TGI stays flat at ~41ms

Quick Reference: A30 vs V100 at a Glance

MetricA30 (4 GPU, vLLM)V100 (4 GPU, vLLM)Winner
Best throughput (7B model)526.4 tok/s400.2 tok/sA30 +32%
Best throughput (13B model)309.7 tok/s228.8 tok/sA30 +35%
p90 latency (7B model)1,951 ms2,561 msA30 −26%
Cost per 1M tokens (7B)INR 190INR 278A30 −32%
Serving stacks supported3 (vLLM, TGI, TRT-LLM)1 (vLLM only)A30
BF16 supportYesNoA30
Max model size (4 GPU)14B (vLLM) / 32B (TGI)32BV100 (for 32B)
Power efficiency4.12 tok/s/W3.41 tok/s/WA30 +21%

1. Test Environment

Hardware

Spec4x A304x V100
GPUNVIDIA A30Tesla V100-PCIE
ArchitectureAmpere (sm_80)Volta (sm_70)
VRAM per GPU24 GB32 GB
Total VRAM96 GB128 GB
BF16 SupportYesNo
InterconnectPCIePCIe
Hourly CostINR 360/hrINR 400/hr
Driver~550.x (approx)¹580.126.09

¹ A30 server was decommissioned before exact driver version was recorded. Version estimated from CUDA compatibility matrix.


Software Stack

ComponentImageDigest (sha256)Pulled
OSUbuntu 22.04 LTS
vLLMvllm/vllm-openai:latest2296a2a7e1ce…2026-03-07
TGIghcr.io/huggingface/text-generation-inference:lateste6b0af6e0bf6…2026-01-08
TensorRT-LLMnvcr.io/nvidia/tensorrt-llm/release:1.3.0rc66f7842605abc…2026-02-27
Tritonnvcr.io/nvidia/tritonserver:26.02-trtllm-python-py3e29ed3221ac3…2026-02-14

Note: :latest tags were pulled on the dates shown. Full digests are recorded in Appendix B for reproducibility.


Models Tested

ModelParametersFP16 Size (approx)
meta-llama/Llama-2-7b-hf7B~14 GB
mistralai/Mistral-7B-v0.17B~14 GB
deepseek-ai/DeepSeek-R1-Distill-Llama-8B8B~16 GB
meta-llama/Llama-2-13b-hf13B~26 GB
deepseek-ai/DeepSeek-R1-Distill-Qwen-14B14B~28 GB
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B32B~64 GB
meta-llama/Llama-2-70b-hf70B~140 GB

Benchmark Configuration

ParameterValue
Prompt length128 tokens
Generation length256 tokens
Batch size (concurrent clients)4
Warmup30 seconds
Measurement window120 seconds
SamplingGreedy (temperature=0.0)
GPU monitoringnvidia-smi at 1s intervals

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

2. Serving Stack Compatibility Matrix

A critical finding before any throughput numbers: most serving stacks no longer support V100. This is not a configuration issue — it's an architectural split driven by the industry's shift to Ampere as the minimum supported compute capability.

StackA30 (Ampere)V100 (Volta)Notes
vLLMWORKSWORKSOnly stack supporting both architectures
TGIWORKSFAILSCUDA error: no kernel image for sm_70 — compiled for Ampere+ only
TensorRT-LLMWORKSFAILSContainer prints "GPU not supported", engine build crashes
TritonPARTIALFAILSA30: version mismatch with TRT engines. V100: "No supported GPU detected"

Triton on A30 failed due to a version mismatch: engines were compiled with TensorRT-LLM 1.3.0rc6, but the Triton 26.02 container only supports TRT-LLM backend v1.1.0. The compiled engines could not be deserialized, resulting in a C++ crash. Rebuilding engines with a matching compiler version would resolve this but was not attempted due to the 12-hour build time.

Implication: Organizations running V100 infrastructure are effectively locked into vLLM for LLM inference. TGI, TensorRT-LLM, and Triton have all moved to Ampere-or-newer as their minimum supported architecture. For V100 deployments, stack diversity is no longer an option — it's a single-vendor dependency on vLLM.

Serving Stack Compatibility Matrix


3. Throughput Results

Understanding the Empty Cells

Before reading the tables, here's why certain cells are empty:

SymbolMeaningExplanation
OOMOut of MemoryModel weights exceed available VRAM. E.g., 13B FP16 (~26GB) cannot fit on 1x A30 (24GB).
crashEngine CrashvLLM v1 engine initialization bug on Volta (V100) for DeepSeek models at low GPU counts. Works on 2+ GPUs via a different tensor-parallel code path.
Not attempted / OOMConfiguration was skipped because the model physically cannot fit. vLLM also uses ~22–24GB per GPU for KV cache pre-allocation, so even 8B models fail on 1x A30 with vLLM (but succeed with TGI's lighter memory footprint).

Why vLLM shows more "—" than TGI on A30: vLLM pre-allocates nearly all VRAM for its PagedAttention KV cache (22.6–24.1 GB per GPU). On A30's 24 GB GPUs, this leaves almost zero headroom after loading even a 7B model. TGI uses a more conservative memory strategy (~21 GB per GPU), allowing it to run models on fewer GPUs where vLLM cannot.


A30 vLLM Throughput: FP16 vs BF16 Across 1–4 GPUs

ModelPrecision1 GPU2 GPUs4 GPUs
Llama-2-7BFP16215.9341.6526.4
Llama-2-7BBF16215.6340.7526.8
Mistral-7BFP16199.4324.1497.8
Mistral-7BBF16199.4323.8497.1
DeepSeek-8BFP16— (OOM)305.7466.2
DeepSeek-8BBF16— (OOM)306.3463.4
Llama-2-13BFP16— (OOM)188.0309.7
Llama-2-13BBF16— (OOM)187.4309.8
DeepSeek-14BFP16— (OOM)— (OOM)267.8
DeepSeek-14BBF16— (OOM)— (OOM)267.7
DeepSeek-32BFP16— (OOM)— (OOM)— (OOM)
DeepSeek-32BBF16— (OOM)— (OOM)OOM Crash
Llama-2-70BFP16/BF16— (OOM)— (OOM)— (OOM)

Why 7B/8B models can't run on 1 A30 GPU with vLLM: vLLM's architecture pre-allocates a large contiguous KV cache on startup. A 7B model in FP16 uses ~14 GB for weights, but vLLM reserves an additional ~9 GB for the KV cache, totaling ~23 GB. On the 24 GB A30, this leaves <1 GB headroom — any additional allocation triggers OOM. Only the two smallest models (Llama-2-7B and Mistral-7B) squeezed through.

Why 13B/14B can't run on 1–2 GPUs: Llama-2-13B in FP16 requires ~26 GB just for weights — exceeding a single A30's 24 GB. On 2 GPUs (48 GB total), the 13B model fits (13 GB per GPU for weights + 10 GB KV cache). The 14B model is slightly larger and needs all 4 GPUs on A30.

Why 32B/70B fail even on 4 GPUs: DeepSeek-32B requires ~64 GB for weights. 4x A30 = 96 GB total, but vLLM's per-GPU overhead leaves insufficient room. The 70B model needs ~140 GB — no configuration on A30 (96 GB max) can accommodate it.


A30 TGI Throughput: FP16 vs BF16 Across 1–4 GPUs

ModelPrecision1 GPU2 GPUs4 GPUs
Llama-2-7BFP16193.2273.5374.3
Llama-2-7BBF16193.6269.6374.4
Mistral-7BFP16179.4254.9359.2
Mistral-7BBF16182.1262.0363.2
DeepSeek-8BFP16169.8242.6340.1
DeepSeek-8BBF16172.8247.8341.1
Llama-2-13BFP16— (OOM)163.8245.2
Llama-2-13BBF16— (OOM)164.0243.6
DeepSeek-14BFP16— (OOM)148.5214.9
DeepSeek-14BBF16— (OOM)148.2215.3
DeepSeek-32BFP16— (OOM)— (OOM)120.8
DeepSeek-32BBF16— (OOM)— (OOM)120.7
Llama-2-70BFP16/BF16— (OOM)— (OOM)— (OOM)

TGI runs 7B/8B on 1 GPU where vLLM cannot. TGI's memory footprint is ~21 GB per GPU vs vLLM's ~23 GB, giving it just enough headroom to load 7–8B models on a single 24 GB A30. However, 13B+ models still exceed single-GPU capacity on both stacks.


A30 TensorRT-LLM Throughput: FP16 and BF16 (4 GPUs only)

Tested using trtllm-serve (TRT-LLM 1.3.0rc6, PyTorch backend). Engine build and serving handled automatically from HuggingFace checkpoints. Only 4-GPU configs attempted — the PyTorch backend requires all GPUs to be specified upfront.

ModelFP16 (tok/s)BF16 (tok/s)
Llama-2-7B453.0451.9
Mistral-7B438.2438.0
DeepSeek-8B419.2420.1
Llama-2-13B278.0278.6
DeepSeek-14B248.4248.3

Note: --backend pytorch mode does not fully compile optimized TensorRT engines. The dedicated TRT engine compilation path (manual build) would likely close the gap with vLLM further but was out of scope here.


A30 Three-Stack Head-to-Head (4 GPUs, FP16)

ModelvLLM (tok/s)TRT-LLM (tok/s)TGI (tok/s)Fastest
Llama-2-7B526.4453.0374.3vLLM (+16% over TRT)
Mistral-7B497.8438.2359.2vLLM (+14% over TRT)
DeepSeek-8B466.2419.2340.1vLLM (+11% over TRT)
Llama-2-13B309.7278.0245.2vLLM (+11% over TRT)
DeepSeek-14B267.8248.4214.9vLLM (+8% over TRT)

Ranking: vLLM > TRT-LLM > TGI on all models.

vLLM leads TRT-LLM by 8–16% and TGI by 25–41%. The gap between vLLM and TRT-LLM narrows for larger models (16% at 7B → 8% at 14B), suggesting compute-bound workloads benefit less from vLLM's memory management advantages. TRT-LLM's PyTorch backend sits comfortably between the two — faster than TGI by 10–21% but slower than vLLM's PagedAttention + continuous batching.

vLLM vs TGI Throughput on 4x A30


V100 LLM Inference Throughput: vLLM Only (FP16, tokens/sec)

Only vLLM supports V100. TGI, TensorRT-LLM, and Triton's latest versions reject V100 hardware. BF16 is not supported on V100 (Volta architecture, compute capability 7.0 — BF16 requires 8.0+).

Model1 GPU2 GPUs4 GPUs
Llama-2-7B153.8250.0400.2
Mistral-7B152.8239.1388.8
DeepSeek-8Bcrash228.5374.5
Llama-2-13B77.4136.9228.8
DeepSeek-14Bcrash133.7210.0
DeepSeek-32BOOMcrash110.0
Llama-2-70BOOMOOMOOM

V100-specific notes:

  • 7B/8B on 1 GPU works (unlike A30) because V100 has 32 GB VRAM vs A30's 24 GB — enough headroom for vLLM's KV cache.
  • DeepSeek-8B and 14B crash on 1 GPU due to a vLLM v1 engine initialization bug specific to DeepSeek's grouped-query attention on Volta (sm_70). The same models work on 2+ GPUs because tensor parallelism uses a different multi-process code path that avoids the buggy kernel.
  • DeepSeek-32B crashes on 2 GPUs — same vLLM engine bug, not a VRAM issue (2×32 GB = 64 GB is enough for 32B weights).
  • Llama-2-70B OOMs everywhere — 70B in FP16 needs ~140 GB, 4×V100 = 128 GB. Neither platform can accommodate it.
  • Llama-2-13B on 1 GPU shows 77.4 tok/s with 13.2s p90 latency — the model barely fits in 32 GB and is compute-bound on a single V100 Tensor Core. Usable for batch processing but not interactive inference.

A30 vs V100 Cross-Hardware Throughput Comparison (vLLM FP16, 4 GPUs)

ModelA30 (tok/s)V100 (tok/s)V100/A30 RatioWinner
Llama-2-7B526.4400.20.76xA30 by 32%
Mistral-7B497.8388.80.78xA30 by 28%
DeepSeek-8B466.2374.50.80xA30 by 24%
Llama-2-13B309.7228.80.74xA30 by 35%
DeepSeek-14B267.8210.00.78xA30 by 28%
DeepSeek-32B110.0V100 (A30 OOM)

A30 is 24–35% faster than V100 on every comparable model, despite having 8 GB less VRAM per GPU. Three factors explain this:

  1. Ampere Tensor Cores (3rd gen) are architecturally superior to Volta's 1st-gen Tensor Cores, with higher FP16 FLOPS per clock cycle.
  2. Memory bandwidth efficiency: A30's HBM2e delivers better effective bandwidth than V100's HBM2, even though raw bandwidth specs are similar.
  3. CUDA kernel optimization: Modern frameworks (vLLM, PyTorch) have Ampere-specific kernel paths that exploit sm_80 features like async memory copy and hardware-accelerated attention.

The one exception: DeepSeek-32B. Only V100 can run it (on 4 GPUs) because 4×32 GB = 128 GB > 64 GB needed. A30's 96 GB total is technically sufficient for the weights, but vLLM's KV cache overhead pushes it over the edge.

LLM Inference Throughput: A30 vs V100


4. Latency Results (p50 / p90 / p99, milliseconds)

Lower is better. p50 = median, p90 = 90th percentile, p99 = 99th percentile. All values in milliseconds.

Note on latency variance: p50 and p90 are within 10 ms of each other across all runs, and p99 is within 20 ms of p90. This extremely tight distribution is due to greedy decoding (temperature=0.0) with a fixed prompt — every request does essentially identical compute work. In production with variable prompts, temperature > 0, and mixed batch sizes, expect significantly wider p50-p99 spreads.

Inference Latency Heatmap: A30 vs V100


A30 vLLM Inference Latency: p50, p90, p99 at 4 GPUs

Modelp50p90p99Spread (p99-p50)Interpretation
Llama-2-7B1,9441,9511,96016ms~2s — good for interactive
Mistral-7B2,0572,0622,0658ms~2s — good for interactive
DeepSeek-8B2,1962,2012,20610ms~2.2s — acceptable for interactive
Llama-2-13B3,3073,3113,3158ms~3.3s — borderline for interactive
DeepSeek-14B3,8233,8323,83714ms~3.8s — batch processing recommended

A30 vLLM Latency Scaling: 1 GPU to 4 GPUs

Model1 GPU2 GPUs4 GPUs
Llama-2-7Bp50/p90/p994,747 / 4,749 / 4,7502,998 / 2,999 / 3,0051,944 / 1,951 / 1,960
Mistral-7Bp50/p90/p995,135 / 5,137 / 5,1383,159 / 3,163 / 3,1642,057 / 2,062 / 2,065
DeepSeek-8Bp50/p90/p99OOM3,347 / 3,351 / 3,3552,196 / 2,201 / 2,206
Llama-2-13Bp50/p90/p99OOM5,454 / 5,458 / 5,4613,307 / 3,311 / 3,315
DeepSeek-14Bp50/p90/p99OOMOOM3,823 / 3,832 / 3,837

A30 TGI Latency vs vLLM: p50, p90, p99 at 4 GPUs

Modelp50p90p99vs vLLM p90
Llama-2-7B2,7462,7502,754TGI 41% slower
Mistral-7B2,8612,8672,872TGI 39% slower
DeepSeek-8B3,0193,0253,031TGI 37% slower
Llama-2-13B4,2024,2084,213TGI 27% slower
DeepSeek-14B4,7894,7974,803TGI 25% slower
DeepSeek-32B8,4918,4998,507vLLM OOM

V100 vLLM Latency Scaling: 1 GPU to 4 GPUs

Model1 GPU2 GPUs4 GPUs
Llama-2-7Bp50/p90/p996,658 / 6,659 / 6,6694,096 / 4,097 / 4,1072,559 / 2,561 / 2,573
Mistral-7Bp50/p90/p996,706 / 6,708 / 6,7184,284 / 4,287 / 4,2962,634 / 2,635 / 2,644
DeepSeek-8Bp50/p90/p99crash4,482 / 4,486 / 4,4902,734 / 2,736 / 2,744
Llama-2-13Bp50/p90/p9913,228 / 13,231 / 13,2377,482 / 7,484 / 7,4944,475 / 4,478 / 4,487
DeepSeek-14Bp50/p90/p99crash7,616 / 7,619 / 8,3674,875 / 4,877 / 4,885
DeepSeek-32Bp50/p90/p99OOMcrash9,310 / 9,316 / 9,326

Anomaly: DeepSeek-14B on 2x V100 shows p99=8,367 ms vs p90=7,619 ms — a 748 ms gap. All other configs have < 20 ms p99-p90 spread. This suggests an occasional long-tail stall, possibly from KV cache reallocation at near-full VRAM (31,542 MB used out of 32,768 MB).

Key latency observations:

  • V100 latency is 24–35% higher than A30 at equivalent configurations.
  • Llama-2-13B on 1x V100 has a 13.2-second p90 — effectively unusable for real-time serving. With 4 GPUs it drops to 4.5s, still only suitable for batch/async workloads.
  • For interactive use (< 3s p90), you need 4x A30 for 7B/8B models, or 4x V100 for 7B models only.
  • 32B models have 8–9s latency even on 4 GPUs — only viable for offline/batch inference on both platforms.
  • Latency variance is negligible (< 0.5% p50-to-p99 spread) under our controlled test conditions. Production deployments with variable inputs will see much higher variance.

5. Time to First Token (TTFT)

TTFT measures how quickly the first token is returned after a request is sent — the most important latency metric for interactive/chat applications. All existing end-to-end latency (Sections 4.1–4.4) is dominated by generation time, not prefill. TTFT isolates prefill cost. Measured via streaming responses on the A30 v2 benchmark run.

A30 — vLLM TTFT (FP16, milliseconds)

Model1 GPU2 GPUs4 GPUs
Llama-2-7B543423
Mistral-7B583624
DeepSeek-8B3825
Llama-2-13B6238
DeepSeek-14B43

A30 — TGI TTFT (FP16, milliseconds)

Model1 GPU2 GPUs4 GPUs
Llama-2-7B424141
Mistral-7B424141
DeepSeek-8B504141
Llama-2-13B4242
DeepSeek-14B5150

A30 — TensorRT-LLM TTFT (FP16, 4 GPUs only, milliseconds)

Model4 GPUs
Llama-2-7B63
Mistral-7B62
DeepSeek-8B63
Llama-2-13B71
DeepSeek-14B90

Key TTFT observations:

  • vLLM TTFT scales dramatically with more GPUs — 54ms → 23ms for Llama-2-7B (1→4 GPU). Tensor parallelism splits the prefill computation, directly reducing time to first token.
  • TGI TTFT stays flat at ~41ms regardless of GPU count. TGI's prefill doesn't scale with parallelism, suggesting a fixed scheduling overhead.
  • At 4 GPUs, vLLM wins on TTFT (23ms vs 41ms TGI vs 63ms TRT-LLM for 7B). vLLM's prefill is 2.7x faster than TRT-LLM's PyTorch backend — a surprising reversal given TRT-LLM's reputation for optimized inference.
  • All TTFT values are under 90ms — well within acceptable range for interactive applications. End-to-end latency (2–10 seconds) is dominated by generation, not prefill.
  • BF16 TTFT is identical to FP16 (< 1ms difference), consistent with throughput findings.

6. VRAM Utilization

A30 (24 GB per GPU)

StackPeak VRAM per GPUNotes
vLLM22,639 – 24,091 MBUses 93–100% of available VRAM. Aggressive memory pre-allocation.
TGI21,061 – 21,721 MBUses 87–90%. More conservative, allows single-GPU 7B runs.

V100 (32 GB per GPU)

ConfigPeak VRAM per GPUNotes
7B models30,078 – 31,010 MB92–95% utilization
13B models30,036 – 31,288 MB92–96% utilization
14B models31,542 – 31,760 MB97% utilization
32B (4 GPU)32,342 MB99% — running at the absolute limit

vLLM pre-allocates nearly all available VRAM for KV cache, which maximizes throughput but means VRAM headroom is minimal. The 32B model on V100 uses 32,342 out of 32,768 MB — just 426 MB of headroom. Any production variation in prompt length could cause OOM events.


7. Scaling Efficiency

Scaling efficiency = throughput_4GPU / (4 × throughput_1GPU) × 100%

ModelA30 (vLLM FP16)V100 (vLLM FP16)
Llama-2-7B61.0%65.0%
Mistral-7B62.4%63.6%
Llama-2-13B73.9%

Key observations:

  • Scaling efficiency is 61–74%, meaning significant overhead from tensor parallelism over PCIe.
  • Larger models (13B) scale better than smaller ones (7B) because the compute-to-communication ratio improves.
  • V100 shows slightly better scaling than A30, possibly due to V100's higher per-GPU VRAM reducing memory pressure during tensor-parallel communication.
  • NVLink would significantly improve these numbers — our test used PCIe interconnects only.

Multi-GPU Scaling Efficiency: A30 vs V100


8. FP16 vs BF16

A30 Results (Ampere — supports both)

ModelStack4 GPU FP16 (tok/s)4 GPU BF16 (tok/s)Difference
Llama-2-7BvLLM526.4526.8+0.1%
Mistral-7BvLLM497.8497.1-0.1%
DeepSeek-8BvLLM466.2463.4-0.6%
Llama-2-7BTGI374.3374.4+0.0%
Mistral-7BTGI359.2363.2+1.1%

BF16 provides no meaningful throughput advantage over FP16 on these models and hardware. The differences are within noise (< 1.1%). BF16's theoretical advantage (wider dynamic range, simpler hardware path) does not translate to measurable throughput gains for inference at this scale.

V100 (Volta — BF16 NOT supported)

All BF16 runs on V100 failed with:

ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0.
Your Tesla V100-PCIE-32GB GPU has compute capability 7.0.

This is a hardware limitation of the Volta architecture. V100 supports FP16 (via Tensor Cores) and FP32, but not BF16. If you are porting workflows that rely on BF16 (common in newer model releases), V100 will require explicit FP16 conversion or will fail at runtime.


9. Cost Analysis

Cost per 1M tokens = (hourly_rate × runtime_seconds / 3600) / (tokens_generated / 1,000,000)

Best Configurations (4 GPUs, FP16)

ModelA30 vLLMA30 TGIV100 vLLMA30 vs V100 Savings
Llama-2-7BINR 190INR 267INR 278A30 saves 32%
Mistral-7BINR 201INR 278INR 286A30 saves 30%
DeepSeek-8BINR 215INR 294INR 297A30 saves 28%
Llama-2-13BINR 323INR 408INR 486A30 saves 34%
DeepSeek-14BINR 374INR 465INR 529A30 saves 29%
DeepSeek-32BINR 828INR 1,011A30 saves 18%

A30 is the clear cost winner — 18–34% cheaper per million tokens than V100, while also being faster. The cost advantage comes from two compounding factors:

  1. A30 has a lower hourly rate (INR 360 vs 400)
  2. A30 generates tokens faster (Ampere efficiency)

The only configuration where V100 has any cost relevance is DeepSeek-32B, where A30 can't even run the model — V100 wins by default.

Cost per Million Tokens: A30 vs V100


10. Power Efficiency

Tokens per second per watt (higher is better):

ModelGPU CountA30 vLLMV100 vLLM
Llama-2-7B1 GPU1.32 tok/s/W1.00 tok/s/W
Llama-2-7B4 GPUs4.12 tok/s/W3.41 tok/s/W
Mistral-7B4 GPUs3.90 tok/s/W3.41 tok/s/W
Llama-2-13B4 GPUs2.24 tok/s/W1.83 tok/s/W

A30 is 17–23% more power-efficient than V100. Both GPUs draw less power per GPU as more GPUs are added (workload distribution reduces per-GPU load), but A30's Ampere architecture extracts more useful compute per watt. For large-scale deployments where power costs are a factor, this compounds the cost advantage beyond just the INR/hr rate difference.

Power Efficiency: Tokens per Second per Watt


11. GPU Utilization

Average GPU compute utilization during steady-state inference (higher = better hardware utilization):

A30

ModelGPU CountvLLMTGI
Llama-2-7B199.3%92.3%
Llama-2-7B298.3%89.1%
Llama-2-7B498.3%86.0%
Mistral-7B199.3%93.3%
Mistral-7B298.3%89.2%
Mistral-7B498.3%86.3%
DeepSeek-8B298.5%90.4%
DeepSeek-8B498.3%86.8%
Llama-2-13B299.4%92.9%
Llama-2-13B499.2%89.9%
DeepSeek-14B499.2%91.5%
DeepSeek-32B494.5%

Key observations:

  • vLLM saturates the GPU (98–99%) across all configs — its continuous batching + PagedAttention keeps compute units fully occupied.
  • TGI drops to 86% at 4 GPUs — tensor parallel overhead and TGI's batching strategy leave more idle cycles.
  • TGI utilization drops with more GPUs (93% → 86% from 1→4 GPU), while vLLM stays flat. This directly explains vLLM's better scaling efficiency in Section 6.
  • DeepSeek-32B on TGI hits 94.5% — the largest model keeps TGI's batching more fully occupied, consistent with the narrowing TGI vs vLLM gap at larger model sizes.

12. Model-Specific Observations

Llama-2-7B

Best-performing model across both hardware platforms. Highest throughput (526 tok/s on A30, 400 tok/s on V100) and lowest cost. If you need maximum throughput for a general-purpose model, Llama-2-7B is the pick.

Mistral-7B

Performs within 5–6% of Llama-2-7B despite having a different architecture (sliding window attention). No significant advantage or disadvantage across the tested configurations.

DeepSeek-R1-Distill-Llama-8B

Slightly slower than Llama-2-7B (10–12% less throughput) despite similar parameter count. The DeepSeek distillation architecture adds overhead. Crashed on V100 single-GPU due to vLLM v1 engine compatibility issues with Volta's GQA kernel path.

Llama-2-13B

~40% slower than 7B models (as expected for ~2x parameters). Scaling efficiency is the best of all models (73.9% on V100), confirming that larger models benefit more from parallelism due to the improved compute-to-communication ratio.

DeepSeek-R1-Distill-Qwen-14B

Similar throughput to Llama-2-13B despite having slightly more parameters. The Qwen architecture handles tensor parallelism well, producing comparable scaling behavior to Llama-class models.

DeepSeek-R1-Distill-Qwen-32B

Only runs on 4 GPUs (needs ~64 GB VRAM). Throughput is 110–121 tok/s. The 32B model pushes V100 to 99% VRAM utilization, leaving almost no headroom. This configuration should not be used for production workloads without careful OOM mitigation.

Llama-2-70B

Could not run on A30 (96 GB total < 140 GB needed) or V100 (128 GB total < 140 GB needed). A 70B model in FP16 requires ~140 GB — neither cluster has sufficient total VRAM. OOM documented on both platforms across all GPU counts tested.


13. Documented Failures Summary

A30 (75 failures out of 125 total runs, 50 successes)

Failure CategoryCountAffected ModelsExplanation
Triton version mismatch28All models × all configsTRT-LLM 1.3.0rc6 engine incompatible with Triton 26.02's 1.1.0 backend
TensorRT-LLM engine build failure227B, 8B, 13B (all GPU/precision combos)Engine compilation failed — these models lack pre-built TRT-LLM engine configs, manual engine build crashed
TGI launch failure18Llama-2-7B, Mistral-7B, DeepSeek-8B (all GPU/precision combos)Container started but produced 0 tokens — model download or weight loading failure (no error notes captured by automation)
TensorRT-LLM OOM614B (4 GPU), 32B (4 GPU), 70B (4 GPU) — FP16+BF16Engine build OOM — model weights exceed available VRAM during TRT compilation
vLLM OOM1DeepSeek-32B BF16 (4 GPU)32B BF16 requires ~64 GB + KV cache, exceeds 96 GB total across 4×A30
Total75

V100 (138 failures out of 152 total runs)

Failure CategoryCountExplanation
TGI unsupported38Latest TGI has no CUDA kernels for sm_70 (Volta)
TensorRT-LLM unsupported38Container rejects V100, engine build fails
Triton unsupported38Container rejects V100 architecture
BF16 unsupported19V100 compute capability 7.0 < 8.0 required
vLLM engine crash3DeepSeek models on single V100 GPU
OOM270B exceeds 128 GB, 32B exceeds 32 GB (1 GPU)

The failure rate on V100 (90.8%) is not primarily due to hardware deficiency — it's an ecosystem shift. Three of the four stacks tested have dropped Volta support entirely. The practical implication is that V100 infrastructure is now a maintenance burden for LLM serving, not a viable long-term deployment platform.


14. Recommendations

For V100 Infrastructure

  • Use vLLM exclusively. It is the only serving stack that supports V100 in its latest release.
  • Avoid models larger than 14B unless you have 4 GPUs available. 32B is possible on 4×V100 but at 99% VRAM with no headroom.
  • Consider migrating to Ampere. The ecosystem is actively dropping Volta support, and the trend will accelerate.

For A30 Infrastructure

  • Use vLLM over TGI for 25–41% better throughput and 28–34% lower cost per token.
  • FP16 is sufficient — BF16 provides no measurable benefit for inference throughput.
  • Budget for 4 GPUs — scaling from 1 to 4 GPUs gives 2.4–3x throughput improvement (61–74% efficiency).

For New Deployments

  • A30 > V100 for LLM inference. A30 is faster, cheaper, more power-efficient, and has broader serving stack support.
  • 70B models are out of scope for both platforms — neither A30 (96 GB) nor V100 (128 GB) can accommodate Llama-2-70B's ~140 GB FP16 footprint.

Serving Stack Recommendations

  • vLLM: Best overall. Highest throughput, best TTFT scaling, broadest hardware support, active development.
  • TGI: Viable alternative on Ampere+, but 25–41% slower than vLLM. Simpler API. TTFT doesn't improve with more GPUs.
  • TensorRT-LLM: 8–16% slower than vLLM with PyTorch backend. Highest TTFT of the three stacks. V100 unsupported. Fully compiled TRT engines (manual build) may close the throughput gap with vLLM.
  • Triton: Version compatibility issues make it impractical without careful container version pinning.

15. Methodology Notes

  • Duration: 120-second measurement windows (after 30s warmup) to match real-world burst inference patterns.
  • Batch size: 4 concurrent async clients to simulate realistic multi-user load.
  • Prompt: Fixed 128-token prompt ("The " repeated) for reproducibility. Not representative of real prompt diversity.
  • Limitations: Single prompt length/generation length tested. Production workloads vary significantly. No quantization (INT8/FP8) was tested.
  • TTFT: Measured via streaming responses on A30 v2 run for all three stacks. See Section 4.5.
  • Reproducibility: All runs used deterministic sampling (temperature=0.0). Docker images pinned to latest at time of test. Raw CSV data available for independent analysis.

16. Limitations & Caveats

This section documents known limitations so that results are interpreted correctly and future work can address the gaps.


16.1 Statistical Rigor

  • A30 data is single-run. Each A30 configuration was benchmarked once. The A30 server was decommissioned before multi-run data could be collected. Observed p50-to-p90 spreads of < 5% suggest low variance, but no standard deviation can be reported.
  • V100 multi-run variance analysis is complete — 3 independent runs × 3 sequence lengths across all working configurations confirms < 0.6% std/mean, validating single-run methodology. See Section 17.

16.2 Sequence Length Coverage

  • Primary results use a single fixed configuration: 128 input tokens + 256 output tokens. This represents a "medium" workload but does not capture behavior at extremes.
  • V100 supplementary data (Section 17) adds short (32+64) and long (512+512) sequence benchmarks across 3 independent runs. Throughput degrades 2–21% from short to long sequences depending on model architecture.
  • Real-world prompts have highly variable lengths (1 token to 128K+ tokens for long-context models). Our fixed-length prompts do not capture KV cache pressure from long contexts.

16.3 Workload Realism

  • Synthetic prompts: All benchmarks use "The " repeated N times. Real prompts have diverse token distributions that can affect attention computation patterns.
  • Fixed concurrency: All tests use 4 concurrent clients. Production systems may see hundreds of concurrent requests with dynamic batching behaving differently at scale.
  • No streaming metrics for V100: V100 TTFT was not measured. A30 TTFT is available in Section 4.5.
  • Temperature=0: Greedy decoding. Sampling (temperature > 0) adds overhead that is not captured here.

16.4 Stack Coverage Gaps

  • V100 is vLLM-only. TGI, TensorRT-LLM, and Triton's latest containers reject Volta GPUs entirely. We did not test older container versions that may still support V100 — this is a deliberate choice to benchmark current-generation software.
  • No quantization tested. INT8, INT4 (GPTQ/AWQ), and FP8 quantization can dramatically change the throughput/VRAM tradeoffs. These were out of scope.
  • No speculative decoding, prefix caching, or chunked prefill — advanced vLLM features that can significantly boost real-world performance were not enabled.

16.5 Hardware Limitations

  • PCIe interconnect only. Both clusters use PCIe, not NVLink. Multi-GPU scaling results would differ significantly with NVLink (expected 10–20% better scaling efficiency).
  • No CPU/memory profiling. CPU utilization and system memory metrics were not systematically collected. GPU metrics come from nvidia-smi sampled at 1-second intervals, which may miss sub-second spikes.
  • Thermal throttling not controlled for. Extended benchmark runs on V100 may trigger thermal throttling, particularly on the 4-GPU configuration. No thermal monitoring beyond nvidia-smi power readings was performed.

16.6 Cost Model Simplification

Cost per 1M tokens is calculated using instance-level hourly rates (INR 360/hr for A30, INR 400/hr for V100) and measured throughput. This does not account for:

  • Server startup/model loading time (can be 2–10 minutes per model)
  • Idle time between requests in production
  • Network transfer costs
  • Storage costs for model weights

16.7 Missing Configurations

GapReasonImpact
Llama-2-70B on V100OOM even at 4×32 GB (140 GB FP16 > 128 GB)Cannot compare largest model on V100
Llama-2-70B BF16 on A30OOM at 4×24 GB (140 GB > 96 GB)Only FP16 tested for 70B on A30
DeepSeek-8B @ 1 GPU (V100)vLLM v1 engine crash (RuntimeError)Missing single-GPU baseline for this model
DeepSeek-14B @ 1 GPU (V100)vLLM v1 engine crashMissing single-GPU baseline
DeepSeek-32B @ ≤2 GPUs (V100)OOM / engine crashOnly 4-GPU data available
All TGI on V100Container built for sm_80+ onlyNo cross-stack comparison possible on V100
All TRT-LLM on V100Container rejects Volta GPUsNo TRT-LLM data on V100
All Triton on V100Container rejects Volta GPUsNo Triton data on V100
TRT-LLM large models on A30Engine build OOM for ≥13B modelsTRT-LLM data only for 7B models on A30

17. V100 Supplementary: Multi-Run Variance & Sequence Length Analysis

126 benchmark runs: 3 independent runs × 3 sequence lengths × 14 model+GPU configurations. All V100 vLLM FP16.

17.1 Throughput Variance (mean ± std, tok/s)

ModelGPUsShort (32+64)Medium (128+256)Long (512+512)
Llama-2-7B1169.0 ± 0.6153.7 ± 0.0121.4 ± 0.0
Llama-2-7B2267.4 ± 0.0249.5 ± 0.0204.1 ± 0.0
Llama-2-7B4424.7 ± 0.2400.2 ± 0.1337.5 ± 0.2
Mistral-7B1155.6 ± 0.1152.8 ± 0.0142.9 ± 0.0
Mistral-7B2240.3 ± 0.0239.0 ± 0.0227.5 ± 0.0
Mistral-7B4390.0 ± 0.2389.3 ± 0.0369.2 ± 0.4
DeepSeek-8B2228.5 ± 1.3228.6 ± 0.1218.1 ± 0.0
DeepSeek-8B4373.2 ± 2.3373.2 ± 0.5355.3 ± 0.2
Llama-2-13B182.6 ± 0.177.4 ± 0.064.8 ± 0.1
Llama-2-13B2144.2 ± 0.5136.6 ± 0.0115.0 ± 0.0
Llama-2-13B4239.6 ± 0.8228.6 ± 0.1196.6 ± 0.1
DeepSeek-14B2134.4 ± 0.5134.5 ± 0.0129.1 ± 0.1
DeepSeek-14B4210.3 ± 0.3210.1 ± 0.0201.3 ± 0.2
DeepSeek-32B4109.4 ± 0.0110.0 ± 0.0107.0 ± 0.0

Variance is negligible. Standard deviation is < 1% of mean across all 126 runs. The maximum std observed is 2.3 tok/s (DeepSeek-8B 4GPU short sequences) — just 0.6% of the mean. This confirms that vLLM's deterministic scheduling produces highly reproducible results under controlled conditions, and that the A30 single-run data in Section 3 is reliable despite lacking multi-run validation.

17.2 Throughput Degradation: Short to Long Sequences

Model4 GPUs Short4 GPUs LongDrop
Llama-2-7B424.7337.5−20.5%
Mistral-7B390.0369.2−5.3%
DeepSeek-8B373.2355.3−4.8%
Llama-2-13B239.6196.6−17.9%
DeepSeek-14B210.3201.3−4.3%
DeepSeek-32B109.4107.0−2.2%
  • Llama models degrade 18–21% — standard attention scales quadratically with sequence length.
  • Mistral-7B drops only 5.3% thanks to sliding window attention (4096 window), which caps memory and compute growth.
  • DeepSeek models also degrade < 5% — their architecture handles longer contexts efficiently.
  • 32B drops only 2.2% — at this model size, compute is dominated by feed-forward layers rather than attention.

17.3 Latency at Different Sequence Lengths (p50, ms, 4 GPUs)

ModelShort (32+64)Medium (128+256)Long (512+512)
Llama-2-7B6032,5596,068
Mistral-7B6562,6305,547
DeepSeek-8B6842,7445,763
Llama-2-13B1,0664,47910,415
DeepSeek-14B1,2174,87410,170
DeepSeek-32B2,3409,30519,132

For interactive use (< 2s latency on V100), only short sequences are viable for 7B/8B models. Medium sequences push all models into 2.5–9s territory. Long sequences are batch-only across the board.


Appendix A: Raw Data Files

FileDescription
metrics_ALL_STACKS_a30_final.csvA30 v1 complete dataset (125 rows)
metrics_a30_v2.csvA30 v2 TRT-LLM retry + TTFT data (48+ rows, includes ttft_p50/p90/p99 columns)
metrics_v100_final.csvV100 complete dataset (152 rows)
metrics_v100_multirun.csvV100 multi-run × multi-seq supplementary (126 rows)

CSV Schema

run_id, stack, model, hardware, gpu_count, precision, batch_size,
prompt_len, gen_len, tokens_generated, runtime_sec, tokens_per_sec,
p50_ms, p90_ms, p99_ms, peak_vram_mb_per_gpu, avg_vram_mb_per_gpu,
gpu_util_pct_avg, power_w_avg, cpu_util_pct, run_start, run_end,
docker_image, notes, cost_per_1M_tokens


Appendix B: Environment Versions & Docker Digests

A30 Server (decommissioned — exact versions not recoverable):
GPU: 4x NVIDIA A30 (24GB)
Driver: NVIDIA 550.x (approximate — server deleted before exact version recorded)
CUDA: 12.x (approximate — bundled with driver)
Docker images: same registry tags as V100, pulled ~1 week earlier

V100 Server:
GPU: 4x Tesla V100-PCIE-32GB
Driver: NVIDIA 580.126.09
CUDA: 13.0 (driver) / containers use their bundled CUDA

Docker Image Digests (for exact reproducibility):
vllm/vllm-openai:latest
sha256:2296a2a7e1ce1dc59c6577ba5900f4e9910b76c4a0cb134833a8137f92404dfa
Pulled: 2026-03-07

ghcr.io/huggingface/text-generation-inference:latest
sha256:e6b0af6e0bf65337b84a19f15d74660c7892192f555fb0b68d3f3d62bf0c1e9a
Pulled: 2026-01-08

nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6
sha256:6f7842605abc44cb0f119bdb12e34cab2b1e6a4e39d2af43f16af644600b9bdd
Pulled: 2026-02-27

nvcr.io/nvidia/tritonserver:26.02-trtllm-python-py3
sha256:e29ed3221ac3d3cc4128cf0ecf4f172db361bb4f474de146c160d370b211e679
Pulled: 2026-02-14


Conclusion

This benchmark establishes a clear, data-driven answer to one of the most common infrastructure questions in LLM deployment today: should you run on NVIDIA A30 or Tesla V100?

Across every dimension that matters for production LLM inference — tokens per second, p90 latency, cost per million tokens, power efficiency, and serving stack compatibility — the A30 wins decisively. It delivers 24–35% higher throughput, 26–35% lower latency, and 18–34% lower cost per million tokens than V100, while supporting three serving stacks versus V100's one. The only scenario where V100 remains relevant is running DeepSeek-32B, where A30's 96 GB total VRAM falls short of vLLM's memory overhead — a narrow edge case.

Beyond hardware, this benchmark reveals a more structural finding: the LLM serving ecosystem has moved on from Volta. TGI, TensorRT-LLM, and Triton have all dropped sm_70 support in their current container releases. Teams still running V100 infrastructure are locked into vLLM as their only option — with no cross-stack comparison, no BF16, and no path to TensorRT-LLM's potential throughput ceiling.

For teams choosing a GPU for multi-GPU LLM serving in 2026, the decision is straightforward:

  • Use A30 with vLLM for the best combination of throughput, cost per million tokens, and serving stack flexibility.
  • Use FP16 — BF16 adds no measurable inference throughput benefit on Ampere at this scale.
  • Plan for 4 GPUs — PCIe scaling plateaus at 61–74% efficiency, but the absolute throughput gain from 1 to 4 GPUs is still 2.4–3x.
  • Avoid V100 for new deployments — the ecosystem is actively deprecating Volta support and the gap will only widen.

Frequently Asked Questions

Is NVIDIA A30 better than V100 for LLM inference?

Yes, across every tested metric. The A30 delivers 24–35% higher tokens per second, 26–35% lower p90 latency, and 18–34% lower cost per million tokens than the V100 on comparable models. It also supports three serving stacks (vLLM, TGI, TensorRT-LLM) versus V100's one (vLLM only). The only exception is DeepSeek-32B, which only fits on 4×V100 (128 GB) due to vLLM's KV cache overhead on A30's 96 GB.

Does vLLM support V100 (Tesla Volta GPUs)?

Yes — vLLM is the only major LLM serving stack that still supports V100 in its current release. TGI, TensorRT-LLM, and Triton's latest containers all reject Volta (sm_70) architecture with CUDA kernel or container-level errors. V100 deployments are effectively locked into vLLM.

What is the cost per million tokens for LLM inference on A30 vs V100?

At 4 GPUs with vLLM (FP16), A30 costs approximately INR 190/million tokens for Llama-2-7B vs INR 278 on V100 — a 32% saving. For Llama-2-13B, A30 costs INR 323 vs INR 486 on V100 — a 34% saving. A30's advantage comes from both a lower hourly rate (INR 360 vs 400/hr) and higher throughput from its Ampere architecture.

Is BF16 faster than FP16 for LLM inference on A30?

No. Across all models and serving stacks tested on A30 (Ampere, sm_80), BF16 and FP16 throughput differ by less than 1.1% — well within measurement noise. BF16 is not available on V100 at all (requires compute capability 8.0+, V100 is 7.0). Use FP16 for maximum hardware compatibility with no throughput penalty.

What is the multi-GPU scaling efficiency for LLM inference?

On both A30 and V100 over PCIe, scaling efficiency from 1 to 4 GPUs is 61–74%. Larger models scale better than smaller ones — Llama-2-13B achieves 73.9% scaling efficiency on V100 vs 61% for Llama-2-7B. NVLink interconnects would significantly improve these numbers; both clusters in this benchmark use PCIe only.

Which LLM serving stack has the highest throughput on A30?

vLLM leads on A30 across all metrics — 8–16% faster than TRT-LLM (PyTorch backend) and 25–41% faster than TGI at 4 GPUs. The ranking is vLLM > TRT-LLM > TGI on every model tested. The vLLM advantage narrows for larger models (16% at 7B → 8% at 14B), since compute-bound workloads benefit less from vLLM's memory management optimizations.

Which stack has the lowest Time to First Token (TTFT)?

On A30 at 4 GPUs, vLLM has the lowest TTFT: 23ms for Llama-2-7B, vs 41ms for TGI and 63ms for TRT-LLM. vLLM's TTFT also scales with more GPUs (54ms on 1 GPU → 23ms on 4 GPU), while TGI's stays flat at ~41ms regardless of GPU count. All three stacks are under 90ms TTFT — comfortably within interactive range.

Can Llama-2-70B run on 4x A30 or 4x V100?

No. Llama-2-70B requires approximately 140 GB in FP16. Four A30 GPUs provide 96 GB total, and four V100 GPUs provide 128 GB — both fall short. This model is out of scope for both platforms benchmarked here.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.