Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
Overview
As LLMs move from research artifacts into production systems, the infrastructure question becomes as critical as the model question. Choosing the wrong GPU or serving stack doesn't just cost money — it determines which models you can run at all, how fast you can serve them, and whether your latency targets are even achievable.
Most benchmarks compare GPUs in isolation or focus on training efficiency. What matters in real deployment is the full picture: which stacks actually run on your hardware, how they compare in throughput and latency under concurrent load, and whether the cost tradeoffs justify the upgrade path.
This report benchmarks 7 LLM models ranging from 7B to 70B parameters across 4 serving stacks — vLLM, TGI, TensorRT-LLM, and Triton — on two GPU clusters: 4x NVIDIA A30 (24 GB) and 4x Tesla V100 (32 GB). The objective is to produce a complete, engineering-grade reference for teams deciding between these platforms for LLM inference workloads.
Key Findings
- vLLM is the only stack that runs on both A30 and V100
- A30 outperforms V100 by 24–35% in throughput despite less VRAM per GPU
- Stack ranking on A30: vLLM > TRT-LLM > TGI — vLLM leads TRT-LLM by 8–16% and TGI by 25–41%
- A30 is 18–34% cheaper per million tokens than V100
- BF16 offers negligible benefit over FP16 on A30 (< 1%), and is completely unavailable on V100
- Scaling efficiency is 61–74% from 1 to 4 GPUs over PCIe
- vLLM TTFT improves dramatically with more GPUs (54ms → 23ms for 7B on 1→4 GPU); TGI stays flat at ~41ms
Quick Reference: A30 vs V100 at a Glance
| Metric | A30 (4 GPU, vLLM) | V100 (4 GPU, vLLM) | Winner |
|---|---|---|---|
| Best throughput (7B model) | 526.4 tok/s | 400.2 tok/s | A30 +32% |
| Best throughput (13B model) | 309.7 tok/s | 228.8 tok/s | A30 +35% |
| p90 latency (7B model) | 1,951 ms | 2,561 ms | A30 −26% |
| Cost per 1M tokens (7B) | INR 190 | INR 278 | A30 −32% |
| Serving stacks supported | 3 (vLLM, TGI, TRT-LLM) | 1 (vLLM only) | A30 |
| BF16 support | Yes | No | A30 |
| Max model size (4 GPU) | 14B (vLLM) / 32B (TGI) | 32B | V100 (for 32B) |
| Power efficiency | 4.12 tok/s/W | 3.41 tok/s/W | A30 +21% |
1. Test Environment
Hardware
| Spec | 4x A30 | 4x V100 |
|---|---|---|
| GPU | NVIDIA A30 | Tesla V100-PCIE |
| Architecture | Ampere (sm_80) | Volta (sm_70) |
| VRAM per GPU | 24 GB | 32 GB |
| Total VRAM | 96 GB | 128 GB |
| BF16 Support | Yes | No |
| Interconnect | PCIe | PCIe |
| Hourly Cost | INR 360/hr | INR 400/hr |
| Driver | ~550.x (approx)¹ | 580.126.09 |
¹ A30 server was decommissioned before exact driver version was recorded. Version estimated from CUDA compatibility matrix.
Software Stack
| Component | Image | Digest (sha256) | Pulled |
|---|---|---|---|
| OS | Ubuntu 22.04 LTS | — | — |
| vLLM | vllm/vllm-openai:latest | 2296a2a7e1ce… | 2026-03-07 |
| TGI | ghcr.io/huggingface/text-generation-inference:latest | e6b0af6e0bf6… | 2026-01-08 |
| TensorRT-LLM | nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6 | 6f7842605abc… | 2026-02-27 |
| Triton | nvcr.io/nvidia/tritonserver:26.02-trtllm-python-py3 | e29ed3221ac3… | 2026-02-14 |
Note: :latest tags were pulled on the dates shown. Full digests are recorded in Appendix B for reproducibility.
Models Tested
| Model | Parameters | FP16 Size (approx) |
|---|---|---|
| meta-llama/Llama-2-7b-hf | 7B | ~14 GB |
| mistralai/Mistral-7B-v0.1 | 7B | ~14 GB |
| deepseek-ai/DeepSeek-R1-Distill-Llama-8B | 8B | ~16 GB |
| meta-llama/Llama-2-13b-hf | 13B | ~26 GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | 14B | ~28 GB |
| deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 32B | ~64 GB |
| meta-llama/Llama-2-70b-hf | 70B | ~140 GB |
Benchmark Configuration
| Parameter | Value |
|---|---|
| Prompt length | 128 tokens |
| Generation length | 256 tokens |
| Batch size (concurrent clients) | 4 |
| Warmup | 30 seconds |
| Measurement window | 120 seconds |
| Sampling | Greedy (temperature=0.0) |
| GPU monitoring | nvidia-smi at 1s intervals |
Get ₹2,000 free credits to test your AI workloads
Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.
2. Serving Stack Compatibility Matrix
A critical finding before any throughput numbers: most serving stacks no longer support V100. This is not a configuration issue — it's an architectural split driven by the industry's shift to Ampere as the minimum supported compute capability.
| Stack | A30 (Ampere) | V100 (Volta) | Notes |
|---|---|---|---|
| vLLM | WORKS | WORKS | Only stack supporting both architectures |
| TGI | WORKS | FAILS | CUDA error: no kernel image for sm_70 — compiled for Ampere+ only |
| TensorRT-LLM | WORKS | FAILS | Container prints "GPU not supported", engine build crashes |
| Triton | PARTIAL | FAILS | A30: version mismatch with TRT engines. V100: "No supported GPU detected" |
Triton on A30 failed due to a version mismatch: engines were compiled with TensorRT-LLM 1.3.0rc6, but the Triton 26.02 container only supports TRT-LLM backend v1.1.0. The compiled engines could not be deserialized, resulting in a C++ crash. Rebuilding engines with a matching compiler version would resolve this but was not attempted due to the 12-hour build time.
Implication: Organizations running V100 infrastructure are effectively locked into vLLM for LLM inference. TGI, TensorRT-LLM, and Triton have all moved to Ampere-or-newer as their minimum supported architecture. For V100 deployments, stack diversity is no longer an option — it's a single-vendor dependency on vLLM.

3. Throughput Results
Understanding the Empty Cells
Before reading the tables, here's why certain cells are empty:
| Symbol | Meaning | Explanation |
|---|---|---|
| OOM | Out of Memory | Model weights exceed available VRAM. E.g., 13B FP16 (~26GB) cannot fit on 1x A30 (24GB). |
| crash | Engine Crash | vLLM v1 engine initialization bug on Volta (V100) for DeepSeek models at low GPU counts. Works on 2+ GPUs via a different tensor-parallel code path. |
| — | Not attempted / OOM | Configuration was skipped because the model physically cannot fit. vLLM also uses ~22–24GB per GPU for KV cache pre-allocation, so even 8B models fail on 1x A30 with vLLM (but succeed with TGI's lighter memory footprint). |
Why vLLM shows more "—" than TGI on A30: vLLM pre-allocates nearly all VRAM for its PagedAttention KV cache (22.6–24.1 GB per GPU). On A30's 24 GB GPUs, this leaves almost zero headroom after loading even a 7B model. TGI uses a more conservative memory strategy (~21 GB per GPU), allowing it to run models on fewer GPUs where vLLM cannot.
A30 vLLM Throughput: FP16 vs BF16 Across 1–4 GPUs
| Model | Precision | 1 GPU | 2 GPUs | 4 GPUs |
|---|---|---|---|---|
| Llama-2-7B | FP16 | 215.9 | 341.6 | 526.4 |
| Llama-2-7B | BF16 | 215.6 | 340.7 | 526.8 |
| Mistral-7B | FP16 | 199.4 | 324.1 | 497.8 |
| Mistral-7B | BF16 | 199.4 | 323.8 | 497.1 |
| DeepSeek-8B | FP16 | — (OOM) | 305.7 | 466.2 |
| DeepSeek-8B | BF16 | — (OOM) | 306.3 | 463.4 |
| Llama-2-13B | FP16 | — (OOM) | 188.0 | 309.7 |
| Llama-2-13B | BF16 | — (OOM) | 187.4 | 309.8 |
| DeepSeek-14B | FP16 | — (OOM) | — (OOM) | 267.8 |
| DeepSeek-14B | BF16 | — (OOM) | — (OOM) | 267.7 |
| DeepSeek-32B | FP16 | — (OOM) | — (OOM) | — (OOM) |
| DeepSeek-32B | BF16 | — (OOM) | — (OOM) | OOM Crash |
| Llama-2-70B | FP16/BF16 | — (OOM) | — (OOM) | — (OOM) |
Why 7B/8B models can't run on 1 A30 GPU with vLLM: vLLM's architecture pre-allocates a large contiguous KV cache on startup. A 7B model in FP16 uses ~14 GB for weights, but vLLM reserves an additional ~9 GB for the KV cache, totaling ~23 GB. On the 24 GB A30, this leaves <1 GB headroom — any additional allocation triggers OOM. Only the two smallest models (Llama-2-7B and Mistral-7B) squeezed through.
Why 13B/14B can't run on 1–2 GPUs: Llama-2-13B in FP16 requires ~26 GB just for weights — exceeding a single A30's 24 GB. On 2 GPUs (48 GB total), the 13B model fits (13 GB per GPU for weights + 10 GB KV cache). The 14B model is slightly larger and needs all 4 GPUs on A30.
Why 32B/70B fail even on 4 GPUs: DeepSeek-32B requires ~64 GB for weights. 4x A30 = 96 GB total, but vLLM's per-GPU overhead leaves insufficient room. The 70B model needs ~140 GB — no configuration on A30 (96 GB max) can accommodate it.
A30 TGI Throughput: FP16 vs BF16 Across 1–4 GPUs
| Model | Precision | 1 GPU | 2 GPUs | 4 GPUs |
|---|---|---|---|---|
| Llama-2-7B | FP16 | 193.2 | 273.5 | 374.3 |
| Llama-2-7B | BF16 | 193.6 | 269.6 | 374.4 |
| Mistral-7B | FP16 | 179.4 | 254.9 | 359.2 |
| Mistral-7B | BF16 | 182.1 | 262.0 | 363.2 |
| DeepSeek-8B | FP16 | 169.8 | 242.6 | 340.1 |
| DeepSeek-8B | BF16 | 172.8 | 247.8 | 341.1 |
| Llama-2-13B | FP16 | — (OOM) | 163.8 | 245.2 |
| Llama-2-13B | BF16 | — (OOM) | 164.0 | 243.6 |
| DeepSeek-14B | FP16 | — (OOM) | 148.5 | 214.9 |
| DeepSeek-14B | BF16 | — (OOM) | 148.2 | 215.3 |
| DeepSeek-32B | FP16 | — (OOM) | — (OOM) | 120.8 |
| DeepSeek-32B | BF16 | — (OOM) | — (OOM) | 120.7 |
| Llama-2-70B | FP16/BF16 | — (OOM) | — (OOM) | — (OOM) |
TGI runs 7B/8B on 1 GPU where vLLM cannot. TGI's memory footprint is ~21 GB per GPU vs vLLM's ~23 GB, giving it just enough headroom to load 7–8B models on a single 24 GB A30. However, 13B+ models still exceed single-GPU capacity on both stacks.
A30 TensorRT-LLM Throughput: FP16 and BF16 (4 GPUs only)
Tested using trtllm-serve (TRT-LLM 1.3.0rc6, PyTorch backend). Engine build and serving handled automatically from HuggingFace checkpoints. Only 4-GPU configs attempted — the PyTorch backend requires all GPUs to be specified upfront.
| Model | FP16 (tok/s) | BF16 (tok/s) |
|---|---|---|
| Llama-2-7B | 453.0 | 451.9 |
| Mistral-7B | 438.2 | 438.0 |
| DeepSeek-8B | 419.2 | 420.1 |
| Llama-2-13B | 278.0 | 278.6 |
| DeepSeek-14B | 248.4 | 248.3 |
Note: --backend pytorch mode does not fully compile optimized TensorRT engines. The dedicated TRT engine compilation path (manual build) would likely close the gap with vLLM further but was out of scope here.
A30 Three-Stack Head-to-Head (4 GPUs, FP16)
| Model | vLLM (tok/s) | TRT-LLM (tok/s) | TGI (tok/s) | Fastest |
|---|---|---|---|---|
| Llama-2-7B | 526.4 | 453.0 | 374.3 | vLLM (+16% over TRT) |
| Mistral-7B | 497.8 | 438.2 | 359.2 | vLLM (+14% over TRT) |
| DeepSeek-8B | 466.2 | 419.2 | 340.1 | vLLM (+11% over TRT) |
| Llama-2-13B | 309.7 | 278.0 | 245.2 | vLLM (+11% over TRT) |
| DeepSeek-14B | 267.8 | 248.4 | 214.9 | vLLM (+8% over TRT) |
Ranking: vLLM > TRT-LLM > TGI on all models.
vLLM leads TRT-LLM by 8–16% and TGI by 25–41%. The gap between vLLM and TRT-LLM narrows for larger models (16% at 7B → 8% at 14B), suggesting compute-bound workloads benefit less from vLLM's memory management advantages. TRT-LLM's PyTorch backend sits comfortably between the two — faster than TGI by 10–21% but slower than vLLM's PagedAttention + continuous batching.

V100 LLM Inference Throughput: vLLM Only (FP16, tokens/sec)
Only vLLM supports V100. TGI, TensorRT-LLM, and Triton's latest versions reject V100 hardware. BF16 is not supported on V100 (Volta architecture, compute capability 7.0 — BF16 requires 8.0+).
| Model | 1 GPU | 2 GPUs | 4 GPUs |
|---|---|---|---|
| Llama-2-7B | 153.8 | 250.0 | 400.2 |
| Mistral-7B | 152.8 | 239.1 | 388.8 |
| DeepSeek-8B | crash | 228.5 | 374.5 |
| Llama-2-13B | 77.4 | 136.9 | 228.8 |
| DeepSeek-14B | crash | 133.7 | 210.0 |
| DeepSeek-32B | OOM | crash | 110.0 |
| Llama-2-70B | OOM | OOM | OOM |
V100-specific notes:
- 7B/8B on 1 GPU works (unlike A30) because V100 has 32 GB VRAM vs A30's 24 GB — enough headroom for vLLM's KV cache.
- DeepSeek-8B and 14B crash on 1 GPU due to a vLLM v1 engine initialization bug specific to DeepSeek's grouped-query attention on Volta (sm_70). The same models work on 2+ GPUs because tensor parallelism uses a different multi-process code path that avoids the buggy kernel.
- DeepSeek-32B crashes on 2 GPUs — same vLLM engine bug, not a VRAM issue (2×32 GB = 64 GB is enough for 32B weights).
- Llama-2-70B OOMs everywhere — 70B in FP16 needs ~140 GB, 4×V100 = 128 GB. Neither platform can accommodate it.
- Llama-2-13B on 1 GPU shows 77.4 tok/s with 13.2s p90 latency — the model barely fits in 32 GB and is compute-bound on a single V100 Tensor Core. Usable for batch processing but not interactive inference.
A30 vs V100 Cross-Hardware Throughput Comparison (vLLM FP16, 4 GPUs)
| Model | A30 (tok/s) | V100 (tok/s) | V100/A30 Ratio | Winner |
|---|---|---|---|---|
| Llama-2-7B | 526.4 | 400.2 | 0.76x | A30 by 32% |
| Mistral-7B | 497.8 | 388.8 | 0.78x | A30 by 28% |
| DeepSeek-8B | 466.2 | 374.5 | 0.80x | A30 by 24% |
| Llama-2-13B | 309.7 | 228.8 | 0.74x | A30 by 35% |
| DeepSeek-14B | 267.8 | 210.0 | 0.78x | A30 by 28% |
| DeepSeek-32B | — | 110.0 | — | V100 (A30 OOM) |
A30 is 24–35% faster than V100 on every comparable model, despite having 8 GB less VRAM per GPU. Three factors explain this:
- Ampere Tensor Cores (3rd gen) are architecturally superior to Volta's 1st-gen Tensor Cores, with higher FP16 FLOPS per clock cycle.
- Memory bandwidth efficiency: A30's HBM2e delivers better effective bandwidth than V100's HBM2, even though raw bandwidth specs are similar.
- CUDA kernel optimization: Modern frameworks (vLLM, PyTorch) have Ampere-specific kernel paths that exploit sm_80 features like async memory copy and hardware-accelerated attention.
The one exception: DeepSeek-32B. Only V100 can run it (on 4 GPUs) because 4×32 GB = 128 GB > 64 GB needed. A30's 96 GB total is technically sufficient for the weights, but vLLM's KV cache overhead pushes it over the edge.

4. Latency Results (p50 / p90 / p99, milliseconds)
Lower is better. p50 = median, p90 = 90th percentile, p99 = 99th percentile. All values in milliseconds.
Note on latency variance: p50 and p90 are within 10 ms of each other across all runs, and p99 is within 20 ms of p90. This extremely tight distribution is due to greedy decoding (temperature=0.0) with a fixed prompt — every request does essentially identical compute work. In production with variable prompts, temperature > 0, and mixed batch sizes, expect significantly wider p50-p99 spreads.

A30 vLLM Inference Latency: p50, p90, p99 at 4 GPUs
| Model | p50 | p90 | p99 | Spread (p99-p50) | Interpretation |
|---|---|---|---|---|---|
| Llama-2-7B | 1,944 | 1,951 | 1,960 | 16ms | ~2s — good for interactive |
| Mistral-7B | 2,057 | 2,062 | 2,065 | 8ms | ~2s — good for interactive |
| DeepSeek-8B | 2,196 | 2,201 | 2,206 | 10ms | ~2.2s — acceptable for interactive |
| Llama-2-13B | 3,307 | 3,311 | 3,315 | 8ms | ~3.3s — borderline for interactive |
| DeepSeek-14B | 3,823 | 3,832 | 3,837 | 14ms | ~3.8s — batch processing recommended |
A30 vLLM Latency Scaling: 1 GPU to 4 GPUs
| Model | 1 GPU | 2 GPUs | 4 GPUs | |
|---|---|---|---|---|
| Llama-2-7B | p50/p90/p99 | 4,747 / 4,749 / 4,750 | 2,998 / 2,999 / 3,005 | 1,944 / 1,951 / 1,960 |
| Mistral-7B | p50/p90/p99 | 5,135 / 5,137 / 5,138 | 3,159 / 3,163 / 3,164 | 2,057 / 2,062 / 2,065 |
| DeepSeek-8B | p50/p90/p99 | OOM | 3,347 / 3,351 / 3,355 | 2,196 / 2,201 / 2,206 |
| Llama-2-13B | p50/p90/p99 | OOM | 5,454 / 5,458 / 5,461 | 3,307 / 3,311 / 3,315 |
| DeepSeek-14B | p50/p90/p99 | OOM | OOM | 3,823 / 3,832 / 3,837 |
A30 TGI Latency vs vLLM: p50, p90, p99 at 4 GPUs
| Model | p50 | p90 | p99 | vs vLLM p90 |
|---|---|---|---|---|
| Llama-2-7B | 2,746 | 2,750 | 2,754 | TGI 41% slower |
| Mistral-7B | 2,861 | 2,867 | 2,872 | TGI 39% slower |
| DeepSeek-8B | 3,019 | 3,025 | 3,031 | TGI 37% slower |
| Llama-2-13B | 4,202 | 4,208 | 4,213 | TGI 27% slower |
| DeepSeek-14B | 4,789 | 4,797 | 4,803 | TGI 25% slower |
| DeepSeek-32B | 8,491 | 8,499 | 8,507 | vLLM OOM |
V100 vLLM Latency Scaling: 1 GPU to 4 GPUs
| Model | 1 GPU | 2 GPUs | 4 GPUs | |
|---|---|---|---|---|
| Llama-2-7B | p50/p90/p99 | 6,658 / 6,659 / 6,669 | 4,096 / 4,097 / 4,107 | 2,559 / 2,561 / 2,573 |
| Mistral-7B | p50/p90/p99 | 6,706 / 6,708 / 6,718 | 4,284 / 4,287 / 4,296 | 2,634 / 2,635 / 2,644 |
| DeepSeek-8B | p50/p90/p99 | crash | 4,482 / 4,486 / 4,490 | 2,734 / 2,736 / 2,744 |
| Llama-2-13B | p50/p90/p99 | 13,228 / 13,231 / 13,237 | 7,482 / 7,484 / 7,494 | 4,475 / 4,478 / 4,487 |
| DeepSeek-14B | p50/p90/p99 | crash | 7,616 / 7,619 / 8,367 | 4,875 / 4,877 / 4,885 |
| DeepSeek-32B | p50/p90/p99 | OOM | crash | 9,310 / 9,316 / 9,326 |
Anomaly: DeepSeek-14B on 2x V100 shows p99=8,367 ms vs p90=7,619 ms — a 748 ms gap. All other configs have < 20 ms p99-p90 spread. This suggests an occasional long-tail stall, possibly from KV cache reallocation at near-full VRAM (31,542 MB used out of 32,768 MB).
Key latency observations:
- V100 latency is 24–35% higher than A30 at equivalent configurations.
- Llama-2-13B on 1x V100 has a 13.2-second p90 — effectively unusable for real-time serving. With 4 GPUs it drops to 4.5s, still only suitable for batch/async workloads.
- For interactive use (< 3s p90), you need 4x A30 for 7B/8B models, or 4x V100 for 7B models only.
- 32B models have 8–9s latency even on 4 GPUs — only viable for offline/batch inference on both platforms.
- Latency variance is negligible (< 0.5% p50-to-p99 spread) under our controlled test conditions. Production deployments with variable inputs will see much higher variance.
5. Time to First Token (TTFT)
TTFT measures how quickly the first token is returned after a request is sent — the most important latency metric for interactive/chat applications. All existing end-to-end latency (Sections 4.1–4.4) is dominated by generation time, not prefill. TTFT isolates prefill cost. Measured via streaming responses on the A30 v2 benchmark run.
A30 — vLLM TTFT (FP16, milliseconds)
| Model | 1 GPU | 2 GPUs | 4 GPUs |
|---|---|---|---|
| Llama-2-7B | 54 | 34 | 23 |
| Mistral-7B | 58 | 36 | 24 |
| DeepSeek-8B | — | 38 | 25 |
| Llama-2-13B | — | 62 | 38 |
| DeepSeek-14B | — | — | 43 |
A30 — TGI TTFT (FP16, milliseconds)
| Model | 1 GPU | 2 GPUs | 4 GPUs |
|---|---|---|---|
| Llama-2-7B | 42 | 41 | 41 |
| Mistral-7B | 42 | 41 | 41 |
| DeepSeek-8B | 50 | 41 | 41 |
| Llama-2-13B | — | 42 | 42 |
| DeepSeek-14B | — | 51 | 50 |
A30 — TensorRT-LLM TTFT (FP16, 4 GPUs only, milliseconds)
| Model | 4 GPUs |
|---|---|
| Llama-2-7B | 63 |
| Mistral-7B | 62 |
| DeepSeek-8B | 63 |
| Llama-2-13B | 71 |
| DeepSeek-14B | 90 |
Key TTFT observations:
- vLLM TTFT scales dramatically with more GPUs — 54ms → 23ms for Llama-2-7B (1→4 GPU). Tensor parallelism splits the prefill computation, directly reducing time to first token.
- TGI TTFT stays flat at ~41ms regardless of GPU count. TGI's prefill doesn't scale with parallelism, suggesting a fixed scheduling overhead.
- At 4 GPUs, vLLM wins on TTFT (23ms vs 41ms TGI vs 63ms TRT-LLM for 7B). vLLM's prefill is 2.7x faster than TRT-LLM's PyTorch backend — a surprising reversal given TRT-LLM's reputation for optimized inference.
- All TTFT values are under 90ms — well within acceptable range for interactive applications. End-to-end latency (2–10 seconds) is dominated by generation, not prefill.
- BF16 TTFT is identical to FP16 (< 1ms difference), consistent with throughput findings.
6. VRAM Utilization
A30 (24 GB per GPU)
| Stack | Peak VRAM per GPU | Notes |
|---|---|---|
| vLLM | 22,639 – 24,091 MB | Uses 93–100% of available VRAM. Aggressive memory pre-allocation. |
| TGI | 21,061 – 21,721 MB | Uses 87–90%. More conservative, allows single-GPU 7B runs. |
V100 (32 GB per GPU)
| Config | Peak VRAM per GPU | Notes |
|---|---|---|
| 7B models | 30,078 – 31,010 MB | 92–95% utilization |
| 13B models | 30,036 – 31,288 MB | 92–96% utilization |
| 14B models | 31,542 – 31,760 MB | 97% utilization |
| 32B (4 GPU) | 32,342 MB | 99% — running at the absolute limit |
vLLM pre-allocates nearly all available VRAM for KV cache, which maximizes throughput but means VRAM headroom is minimal. The 32B model on V100 uses 32,342 out of 32,768 MB — just 426 MB of headroom. Any production variation in prompt length could cause OOM events.
7. Scaling Efficiency
Scaling efficiency = throughput_4GPU / (4 × throughput_1GPU) × 100%
| Model | A30 (vLLM FP16) | V100 (vLLM FP16) |
|---|---|---|
| Llama-2-7B | 61.0% | 65.0% |
| Mistral-7B | 62.4% | 63.6% |
| Llama-2-13B | — | 73.9% |
Key observations:
- Scaling efficiency is 61–74%, meaning significant overhead from tensor parallelism over PCIe.
- Larger models (13B) scale better than smaller ones (7B) because the compute-to-communication ratio improves.
- V100 shows slightly better scaling than A30, possibly due to V100's higher per-GPU VRAM reducing memory pressure during tensor-parallel communication.
- NVLink would significantly improve these numbers — our test used PCIe interconnects only.

8. FP16 vs BF16
A30 Results (Ampere — supports both)
| Model | Stack | 4 GPU FP16 (tok/s) | 4 GPU BF16 (tok/s) | Difference |
|---|---|---|---|---|
| Llama-2-7B | vLLM | 526.4 | 526.8 | +0.1% |
| Mistral-7B | vLLM | 497.8 | 497.1 | -0.1% |
| DeepSeek-8B | vLLM | 466.2 | 463.4 | -0.6% |
| Llama-2-7B | TGI | 374.3 | 374.4 | +0.0% |
| Mistral-7B | TGI | 359.2 | 363.2 | +1.1% |
BF16 provides no meaningful throughput advantage over FP16 on these models and hardware. The differences are within noise (< 1.1%). BF16's theoretical advantage (wider dynamic range, simpler hardware path) does not translate to measurable throughput gains for inference at this scale.
V100 (Volta — BF16 NOT supported)
All BF16 runs on V100 failed with:
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0.
Your Tesla V100-PCIE-32GB GPU has compute capability 7.0.
This is a hardware limitation of the Volta architecture. V100 supports FP16 (via Tensor Cores) and FP32, but not BF16. If you are porting workflows that rely on BF16 (common in newer model releases), V100 will require explicit FP16 conversion or will fail at runtime.
9. Cost Analysis
Cost per 1M tokens = (hourly_rate × runtime_seconds / 3600) / (tokens_generated / 1,000,000)
Best Configurations (4 GPUs, FP16)
| Model | A30 vLLM | A30 TGI | V100 vLLM | A30 vs V100 Savings |
|---|---|---|---|---|
| Llama-2-7B | INR 190 | INR 267 | INR 278 | A30 saves 32% |
| Mistral-7B | INR 201 | INR 278 | INR 286 | A30 saves 30% |
| DeepSeek-8B | INR 215 | INR 294 | INR 297 | A30 saves 28% |
| Llama-2-13B | INR 323 | INR 408 | INR 486 | A30 saves 34% |
| DeepSeek-14B | INR 374 | INR 465 | INR 529 | A30 saves 29% |
| DeepSeek-32B | — | INR 828 | INR 1,011 | A30 saves 18% |
A30 is the clear cost winner — 18–34% cheaper per million tokens than V100, while also being faster. The cost advantage comes from two compounding factors:
- A30 has a lower hourly rate (INR 360 vs 400)
- A30 generates tokens faster (Ampere efficiency)
The only configuration where V100 has any cost relevance is DeepSeek-32B, where A30 can't even run the model — V100 wins by default.

10. Power Efficiency
Tokens per second per watt (higher is better):
| Model | GPU Count | A30 vLLM | V100 vLLM |
|---|---|---|---|
| Llama-2-7B | 1 GPU | 1.32 tok/s/W | 1.00 tok/s/W |
| Llama-2-7B | 4 GPUs | 4.12 tok/s/W | 3.41 tok/s/W |
| Mistral-7B | 4 GPUs | 3.90 tok/s/W | 3.41 tok/s/W |
| Llama-2-13B | 4 GPUs | 2.24 tok/s/W | 1.83 tok/s/W |
A30 is 17–23% more power-efficient than V100. Both GPUs draw less power per GPU as more GPUs are added (workload distribution reduces per-GPU load), but A30's Ampere architecture extracts more useful compute per watt. For large-scale deployments where power costs are a factor, this compounds the cost advantage beyond just the INR/hr rate difference.

11. GPU Utilization
Average GPU compute utilization during steady-state inference (higher = better hardware utilization):
A30
| Model | GPU Count | vLLM | TGI |
|---|---|---|---|
| Llama-2-7B | 1 | 99.3% | 92.3% |
| Llama-2-7B | 2 | 98.3% | 89.1% |
| Llama-2-7B | 4 | 98.3% | 86.0% |
| Mistral-7B | 1 | 99.3% | 93.3% |
| Mistral-7B | 2 | 98.3% | 89.2% |
| Mistral-7B | 4 | 98.3% | 86.3% |
| DeepSeek-8B | 2 | 98.5% | 90.4% |
| DeepSeek-8B | 4 | 98.3% | 86.8% |
| Llama-2-13B | 2 | 99.4% | 92.9% |
| Llama-2-13B | 4 | 99.2% | 89.9% |
| DeepSeek-14B | 4 | 99.2% | 91.5% |
| DeepSeek-32B | 4 | — | 94.5% |
Key observations:
- vLLM saturates the GPU (98–99%) across all configs — its continuous batching + PagedAttention keeps compute units fully occupied.
- TGI drops to 86% at 4 GPUs — tensor parallel overhead and TGI's batching strategy leave more idle cycles.
- TGI utilization drops with more GPUs (93% → 86% from 1→4 GPU), while vLLM stays flat. This directly explains vLLM's better scaling efficiency in Section 6.
- DeepSeek-32B on TGI hits 94.5% — the largest model keeps TGI's batching more fully occupied, consistent with the narrowing TGI vs vLLM gap at larger model sizes.
12. Model-Specific Observations
Llama-2-7B
Best-performing model across both hardware platforms. Highest throughput (526 tok/s on A30, 400 tok/s on V100) and lowest cost. If you need maximum throughput for a general-purpose model, Llama-2-7B is the pick.
Mistral-7B
Performs within 5–6% of Llama-2-7B despite having a different architecture (sliding window attention). No significant advantage or disadvantage across the tested configurations.
DeepSeek-R1-Distill-Llama-8B
Slightly slower than Llama-2-7B (10–12% less throughput) despite similar parameter count. The DeepSeek distillation architecture adds overhead. Crashed on V100 single-GPU due to vLLM v1 engine compatibility issues with Volta's GQA kernel path.
Llama-2-13B
~40% slower than 7B models (as expected for ~2x parameters). Scaling efficiency is the best of all models (73.9% on V100), confirming that larger models benefit more from parallelism due to the improved compute-to-communication ratio.
DeepSeek-R1-Distill-Qwen-14B
Similar throughput to Llama-2-13B despite having slightly more parameters. The Qwen architecture handles tensor parallelism well, producing comparable scaling behavior to Llama-class models.
DeepSeek-R1-Distill-Qwen-32B
Only runs on 4 GPUs (needs ~64 GB VRAM). Throughput is 110–121 tok/s. The 32B model pushes V100 to 99% VRAM utilization, leaving almost no headroom. This configuration should not be used for production workloads without careful OOM mitigation.
Llama-2-70B
Could not run on A30 (96 GB total < 140 GB needed) or V100 (128 GB total < 140 GB needed). A 70B model in FP16 requires ~140 GB — neither cluster has sufficient total VRAM. OOM documented on both platforms across all GPU counts tested.
13. Documented Failures Summary
A30 (75 failures out of 125 total runs, 50 successes)
| Failure Category | Count | Affected Models | Explanation |
|---|---|---|---|
| Triton version mismatch | 28 | All models × all configs | TRT-LLM 1.3.0rc6 engine incompatible with Triton 26.02's 1.1.0 backend |
| TensorRT-LLM engine build failure | 22 | 7B, 8B, 13B (all GPU/precision combos) | Engine compilation failed — these models lack pre-built TRT-LLM engine configs, manual engine build crashed |
| TGI launch failure | 18 | Llama-2-7B, Mistral-7B, DeepSeek-8B (all GPU/precision combos) | Container started but produced 0 tokens — model download or weight loading failure (no error notes captured by automation) |
| TensorRT-LLM OOM | 6 | 14B (4 GPU), 32B (4 GPU), 70B (4 GPU) — FP16+BF16 | Engine build OOM — model weights exceed available VRAM during TRT compilation |
| vLLM OOM | 1 | DeepSeek-32B BF16 (4 GPU) | 32B BF16 requires ~64 GB + KV cache, exceeds 96 GB total across 4×A30 |
| Total | 75 |
V100 (138 failures out of 152 total runs)
| Failure Category | Count | Explanation |
|---|---|---|
| TGI unsupported | 38 | Latest TGI has no CUDA kernels for sm_70 (Volta) |
| TensorRT-LLM unsupported | 38 | Container rejects V100, engine build fails |
| Triton unsupported | 38 | Container rejects V100 architecture |
| BF16 unsupported | 19 | V100 compute capability 7.0 < 8.0 required |
| vLLM engine crash | 3 | DeepSeek models on single V100 GPU |
| OOM | 2 | 70B exceeds 128 GB, 32B exceeds 32 GB (1 GPU) |
The failure rate on V100 (90.8%) is not primarily due to hardware deficiency — it's an ecosystem shift. Three of the four stacks tested have dropped Volta support entirely. The practical implication is that V100 infrastructure is now a maintenance burden for LLM serving, not a viable long-term deployment platform.
14. Recommendations
For V100 Infrastructure
- Use vLLM exclusively. It is the only serving stack that supports V100 in its latest release.
- Avoid models larger than 14B unless you have 4 GPUs available. 32B is possible on 4×V100 but at 99% VRAM with no headroom.
- Consider migrating to Ampere. The ecosystem is actively dropping Volta support, and the trend will accelerate.
For A30 Infrastructure
- Use vLLM over TGI for 25–41% better throughput and 28–34% lower cost per token.
- FP16 is sufficient — BF16 provides no measurable benefit for inference throughput.
- Budget for 4 GPUs — scaling from 1 to 4 GPUs gives 2.4–3x throughput improvement (61–74% efficiency).
For New Deployments
- A30 > V100 for LLM inference. A30 is faster, cheaper, more power-efficient, and has broader serving stack support.
- 70B models are out of scope for both platforms — neither A30 (96 GB) nor V100 (128 GB) can accommodate Llama-2-70B's ~140 GB FP16 footprint.
Serving Stack Recommendations
- vLLM: Best overall. Highest throughput, best TTFT scaling, broadest hardware support, active development.
- TGI: Viable alternative on Ampere+, but 25–41% slower than vLLM. Simpler API. TTFT doesn't improve with more GPUs.
- TensorRT-LLM: 8–16% slower than vLLM with PyTorch backend. Highest TTFT of the three stacks. V100 unsupported. Fully compiled TRT engines (manual build) may close the throughput gap with vLLM.
- Triton: Version compatibility issues make it impractical without careful container version pinning.
15. Methodology Notes
- Duration: 120-second measurement windows (after 30s warmup) to match real-world burst inference patterns.
- Batch size: 4 concurrent async clients to simulate realistic multi-user load.
- Prompt: Fixed 128-token prompt ("The " repeated) for reproducibility. Not representative of real prompt diversity.
- Limitations: Single prompt length/generation length tested. Production workloads vary significantly. No quantization (INT8/FP8) was tested.
- TTFT: Measured via streaming responses on A30 v2 run for all three stacks. See Section 4.5.
- Reproducibility: All runs used deterministic sampling (temperature=0.0). Docker images pinned to latest at time of test. Raw CSV data available for independent analysis.
16. Limitations & Caveats
This section documents known limitations so that results are interpreted correctly and future work can address the gaps.
16.1 Statistical Rigor
- A30 data is single-run. Each A30 configuration was benchmarked once. The A30 server was decommissioned before multi-run data could be collected. Observed p50-to-p90 spreads of < 5% suggest low variance, but no standard deviation can be reported.
- V100 multi-run variance analysis is complete — 3 independent runs × 3 sequence lengths across all working configurations confirms < 0.6% std/mean, validating single-run methodology. See Section 17.
16.2 Sequence Length Coverage
- Primary results use a single fixed configuration: 128 input tokens + 256 output tokens. This represents a "medium" workload but does not capture behavior at extremes.
- V100 supplementary data (Section 17) adds short (32+64) and long (512+512) sequence benchmarks across 3 independent runs. Throughput degrades 2–21% from short to long sequences depending on model architecture.
- Real-world prompts have highly variable lengths (1 token to 128K+ tokens for long-context models). Our fixed-length prompts do not capture KV cache pressure from long contexts.
16.3 Workload Realism
- Synthetic prompts: All benchmarks use "The " repeated N times. Real prompts have diverse token distributions that can affect attention computation patterns.
- Fixed concurrency: All tests use 4 concurrent clients. Production systems may see hundreds of concurrent requests with dynamic batching behaving differently at scale.
- No streaming metrics for V100: V100 TTFT was not measured. A30 TTFT is available in Section 4.5.
- Temperature=0: Greedy decoding. Sampling (temperature > 0) adds overhead that is not captured here.
16.4 Stack Coverage Gaps
- V100 is vLLM-only. TGI, TensorRT-LLM, and Triton's latest containers reject Volta GPUs entirely. We did not test older container versions that may still support V100 — this is a deliberate choice to benchmark current-generation software.
- No quantization tested. INT8, INT4 (GPTQ/AWQ), and FP8 quantization can dramatically change the throughput/VRAM tradeoffs. These were out of scope.
- No speculative decoding, prefix caching, or chunked prefill — advanced vLLM features that can significantly boost real-world performance were not enabled.
16.5 Hardware Limitations
- PCIe interconnect only. Both clusters use PCIe, not NVLink. Multi-GPU scaling results would differ significantly with NVLink (expected 10–20% better scaling efficiency).
- No CPU/memory profiling. CPU utilization and system memory metrics were not systematically collected. GPU metrics come from nvidia-smi sampled at 1-second intervals, which may miss sub-second spikes.
- Thermal throttling not controlled for. Extended benchmark runs on V100 may trigger thermal throttling, particularly on the 4-GPU configuration. No thermal monitoring beyond nvidia-smi power readings was performed.
16.6 Cost Model Simplification
Cost per 1M tokens is calculated using instance-level hourly rates (INR 360/hr for A30, INR 400/hr for V100) and measured throughput. This does not account for:
- Server startup/model loading time (can be 2–10 minutes per model)
- Idle time between requests in production
- Network transfer costs
- Storage costs for model weights
16.7 Missing Configurations
| Gap | Reason | Impact |
|---|---|---|
| Llama-2-70B on V100 | OOM even at 4×32 GB (140 GB FP16 > 128 GB) | Cannot compare largest model on V100 |
| Llama-2-70B BF16 on A30 | OOM at 4×24 GB (140 GB > 96 GB) | Only FP16 tested for 70B on A30 |
| DeepSeek-8B @ 1 GPU (V100) | vLLM v1 engine crash (RuntimeError) | Missing single-GPU baseline for this model |
| DeepSeek-14B @ 1 GPU (V100) | vLLM v1 engine crash | Missing single-GPU baseline |
| DeepSeek-32B @ ≤2 GPUs (V100) | OOM / engine crash | Only 4-GPU data available |
| All TGI on V100 | Container built for sm_80+ only | No cross-stack comparison possible on V100 |
| All TRT-LLM on V100 | Container rejects Volta GPUs | No TRT-LLM data on V100 |
| All Triton on V100 | Container rejects Volta GPUs | No Triton data on V100 |
| TRT-LLM large models on A30 | Engine build OOM for ≥13B models | TRT-LLM data only for 7B models on A30 |
17. V100 Supplementary: Multi-Run Variance & Sequence Length Analysis
126 benchmark runs: 3 independent runs × 3 sequence lengths × 14 model+GPU configurations. All V100 vLLM FP16.
17.1 Throughput Variance (mean ± std, tok/s)
| Model | GPUs | Short (32+64) | Medium (128+256) | Long (512+512) |
|---|---|---|---|---|
| Llama-2-7B | 1 | 169.0 ± 0.6 | 153.7 ± 0.0 | 121.4 ± 0.0 |
| Llama-2-7B | 2 | 267.4 ± 0.0 | 249.5 ± 0.0 | 204.1 ± 0.0 |
| Llama-2-7B | 4 | 424.7 ± 0.2 | 400.2 ± 0.1 | 337.5 ± 0.2 |
| Mistral-7B | 1 | 155.6 ± 0.1 | 152.8 ± 0.0 | 142.9 ± 0.0 |
| Mistral-7B | 2 | 240.3 ± 0.0 | 239.0 ± 0.0 | 227.5 ± 0.0 |
| Mistral-7B | 4 | 390.0 ± 0.2 | 389.3 ± 0.0 | 369.2 ± 0.4 |
| DeepSeek-8B | 2 | 228.5 ± 1.3 | 228.6 ± 0.1 | 218.1 ± 0.0 |
| DeepSeek-8B | 4 | 373.2 ± 2.3 | 373.2 ± 0.5 | 355.3 ± 0.2 |
| Llama-2-13B | 1 | 82.6 ± 0.1 | 77.4 ± 0.0 | 64.8 ± 0.1 |
| Llama-2-13B | 2 | 144.2 ± 0.5 | 136.6 ± 0.0 | 115.0 ± 0.0 |
| Llama-2-13B | 4 | 239.6 ± 0.8 | 228.6 ± 0.1 | 196.6 ± 0.1 |
| DeepSeek-14B | 2 | 134.4 ± 0.5 | 134.5 ± 0.0 | 129.1 ± 0.1 |
| DeepSeek-14B | 4 | 210.3 ± 0.3 | 210.1 ± 0.0 | 201.3 ± 0.2 |
| DeepSeek-32B | 4 | 109.4 ± 0.0 | 110.0 ± 0.0 | 107.0 ± 0.0 |
Variance is negligible. Standard deviation is < 1% of mean across all 126 runs. The maximum std observed is 2.3 tok/s (DeepSeek-8B 4GPU short sequences) — just 0.6% of the mean. This confirms that vLLM's deterministic scheduling produces highly reproducible results under controlled conditions, and that the A30 single-run data in Section 3 is reliable despite lacking multi-run validation.
17.2 Throughput Degradation: Short to Long Sequences
| Model | 4 GPUs Short | 4 GPUs Long | Drop |
|---|---|---|---|
| Llama-2-7B | 424.7 | 337.5 | −20.5% |
| Mistral-7B | 390.0 | 369.2 | −5.3% |
| DeepSeek-8B | 373.2 | 355.3 | −4.8% |
| Llama-2-13B | 239.6 | 196.6 | −17.9% |
| DeepSeek-14B | 210.3 | 201.3 | −4.3% |
| DeepSeek-32B | 109.4 | 107.0 | −2.2% |
- Llama models degrade 18–21% — standard attention scales quadratically with sequence length.
- Mistral-7B drops only 5.3% thanks to sliding window attention (4096 window), which caps memory and compute growth.
- DeepSeek models also degrade < 5% — their architecture handles longer contexts efficiently.
- 32B drops only 2.2% — at this model size, compute is dominated by feed-forward layers rather than attention.
17.3 Latency at Different Sequence Lengths (p50, ms, 4 GPUs)
| Model | Short (32+64) | Medium (128+256) | Long (512+512) |
|---|---|---|---|
| Llama-2-7B | 603 | 2,559 | 6,068 |
| Mistral-7B | 656 | 2,630 | 5,547 |
| DeepSeek-8B | 684 | 2,744 | 5,763 |
| Llama-2-13B | 1,066 | 4,479 | 10,415 |
| DeepSeek-14B | 1,217 | 4,874 | 10,170 |
| DeepSeek-32B | 2,340 | 9,305 | 19,132 |
For interactive use (< 2s latency on V100), only short sequences are viable for 7B/8B models. Medium sequences push all models into 2.5–9s territory. Long sequences are batch-only across the board.
Appendix A: Raw Data Files
| File | Description |
|---|---|
| metrics_ALL_STACKS_a30_final.csv | A30 v1 complete dataset (125 rows) |
| metrics_a30_v2.csv | A30 v2 TRT-LLM retry + TTFT data (48+ rows, includes ttft_p50/p90/p99 columns) |
| metrics_v100_final.csv | V100 complete dataset (152 rows) |
| metrics_v100_multirun.csv | V100 multi-run × multi-seq supplementary (126 rows) |
CSV Schema
run_id, stack, model, hardware, gpu_count, precision, batch_size,
prompt_len, gen_len, tokens_generated, runtime_sec, tokens_per_sec,
p50_ms, p90_ms, p99_ms, peak_vram_mb_per_gpu, avg_vram_mb_per_gpu,
gpu_util_pct_avg, power_w_avg, cpu_util_pct, run_start, run_end,
docker_image, notes, cost_per_1M_tokens
Appendix B: Environment Versions & Docker Digests
A30 Server (decommissioned — exact versions not recoverable):
GPU: 4x NVIDIA A30 (24GB)
Driver: NVIDIA 550.x (approximate — server deleted before exact version recorded)
CUDA: 12.x (approximate — bundled with driver)
Docker images: same registry tags as V100, pulled ~1 week earlier
V100 Server:
GPU: 4x Tesla V100-PCIE-32GB
Driver: NVIDIA 580.126.09
CUDA: 13.0 (driver) / containers use their bundled CUDA
Docker Image Digests (for exact reproducibility):
vllm/vllm-openai:latest
sha256:2296a2a7e1ce1dc59c6577ba5900f4e9910b76c4a0cb134833a8137f92404dfa
Pulled: 2026-03-07
ghcr.io/huggingface/text-generation-inference:latest
sha256:e6b0af6e0bf65337b84a19f15d74660c7892192f555fb0b68d3f3d62bf0c1e9a
Pulled: 2026-01-08
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc6
sha256:6f7842605abc44cb0f119bdb12e34cab2b1e6a4e39d2af43f16af644600b9bdd
Pulled: 2026-02-27
nvcr.io/nvidia/tritonserver:26.02-trtllm-python-py3
sha256:e29ed3221ac3d3cc4128cf0ecf4f172db361bb4f474de146c160d370b211e679
Pulled: 2026-02-14
Conclusion
This benchmark establishes a clear, data-driven answer to one of the most common infrastructure questions in LLM deployment today: should you run on NVIDIA A30 or Tesla V100?
Across every dimension that matters for production LLM inference — tokens per second, p90 latency, cost per million tokens, power efficiency, and serving stack compatibility — the A30 wins decisively. It delivers 24–35% higher throughput, 26–35% lower latency, and 18–34% lower cost per million tokens than V100, while supporting three serving stacks versus V100's one. The only scenario where V100 remains relevant is running DeepSeek-32B, where A30's 96 GB total VRAM falls short of vLLM's memory overhead — a narrow edge case.
Beyond hardware, this benchmark reveals a more structural finding: the LLM serving ecosystem has moved on from Volta. TGI, TensorRT-LLM, and Triton have all dropped sm_70 support in their current container releases. Teams still running V100 infrastructure are locked into vLLM as their only option — with no cross-stack comparison, no BF16, and no path to TensorRT-LLM's potential throughput ceiling.
For teams choosing a GPU for multi-GPU LLM serving in 2026, the decision is straightforward:
- Use A30 with vLLM for the best combination of throughput, cost per million tokens, and serving stack flexibility.
- Use FP16 — BF16 adds no measurable inference throughput benefit on Ampere at this scale.
- Plan for 4 GPUs — PCIe scaling plateaus at 61–74% efficiency, but the absolute throughput gain from 1 to 4 GPUs is still 2.4–3x.
- Avoid V100 for new deployments — the ecosystem is actively deprecating Volta support and the gap will only widen.
Frequently Asked Questions
Is NVIDIA A30 better than V100 for LLM inference?
Yes, across every tested metric. The A30 delivers 24–35% higher tokens per second, 26–35% lower p90 latency, and 18–34% lower cost per million tokens than the V100 on comparable models. It also supports three serving stacks (vLLM, TGI, TensorRT-LLM) versus V100's one (vLLM only). The only exception is DeepSeek-32B, which only fits on 4×V100 (128 GB) due to vLLM's KV cache overhead on A30's 96 GB.
Does vLLM support V100 (Tesla Volta GPUs)?
Yes — vLLM is the only major LLM serving stack that still supports V100 in its current release. TGI, TensorRT-LLM, and Triton's latest containers all reject Volta (sm_70) architecture with CUDA kernel or container-level errors. V100 deployments are effectively locked into vLLM.
What is the cost per million tokens for LLM inference on A30 vs V100?
At 4 GPUs with vLLM (FP16), A30 costs approximately INR 190/million tokens for Llama-2-7B vs INR 278 on V100 — a 32% saving. For Llama-2-13B, A30 costs INR 323 vs INR 486 on V100 — a 34% saving. A30's advantage comes from both a lower hourly rate (INR 360 vs 400/hr) and higher throughput from its Ampere architecture.
Is BF16 faster than FP16 for LLM inference on A30?
No. Across all models and serving stacks tested on A30 (Ampere, sm_80), BF16 and FP16 throughput differ by less than 1.1% — well within measurement noise. BF16 is not available on V100 at all (requires compute capability 8.0+, V100 is 7.0). Use FP16 for maximum hardware compatibility with no throughput penalty.
What is the multi-GPU scaling efficiency for LLM inference?
On both A30 and V100 over PCIe, scaling efficiency from 1 to 4 GPUs is 61–74%. Larger models scale better than smaller ones — Llama-2-13B achieves 73.9% scaling efficiency on V100 vs 61% for Llama-2-7B. NVLink interconnects would significantly improve these numbers; both clusters in this benchmark use PCIe only.
Which LLM serving stack has the highest throughput on A30?
vLLM leads on A30 across all metrics — 8–16% faster than TRT-LLM (PyTorch backend) and 25–41% faster than TGI at 4 GPUs. The ranking is vLLM > TRT-LLM > TGI on every model tested. The vLLM advantage narrows for larger models (16% at 7B → 8% at 14B), since compute-bound workloads benefit less from vLLM's memory management optimizations.
Which stack has the lowest Time to First Token (TTFT)?
On A30 at 4 GPUs, vLLM has the lowest TTFT: 23ms for Llama-2-7B, vs 41ms for TGI and 63ms for TRT-LLM. vLLM's TTFT also scales with more GPUs (54ms on 1 GPU → 23ms on 4 GPU), while TGI's stays flat at ~41ms regardless of GPU count. All three stacks are under 90ms TTFT — comfortably within interactive range.
Can Llama-2-70B run on 4x A30 or 4x V100?
No. Llama-2-70B requires approximately 140 GB in FP16. Four A30 GPUs provide 96 GB total, and four V100 GPUs provide 128 GB — both fall short. This model is out of scope for both platforms benchmarked here.


