Best GPU for LLM Training: A100, H100, H200 Compared

Deploybase · March 2, 2026 · GPU Cloud

Contents


Best GPU for LLM Training: Overview

LLM training GPU choice = model size + budget + time.

A100: proven, 80GB, cost-effective. H100: 3x faster, NVLink, 50-70% time savings. H200: 141GB memory. B200: newest, 5x H100 cost.

Startups: A100. Time-critical: H100. Match GPU to model size and budget.


GPU Specification Comparison

SpecificationA100 PCIeH100 PCIeH200B200RTX 4090
Memory (VRAM)80GB80GB141GB192GB24GB
Memory TypeHBM2eHBM2eHBM3eHBM3eGDDR6X
Bandwidth2.0 TB/s2.0 TB/s4.8 TB/s8.0 TB/s1.0 TB/s
FP32 Throughput19.5 TFLOPS60 TFLOPS79 TFLOPS180 TFLOPS82.6 TFLOPS
Tensor Float 32156 TFLOPS1.4 PETA1.8 PETA5.3 PETA1.3 PETA
TDP250W350W575W1,000W575W
ArchitectureAmpereHopperHopperBlackwellAda
NVLink SupportYes (200 GB/s)Yes (900 GB/s)Yes (900 GB/s)Yes (1.8 TB/s)No (PCIe only)
Release Year20202023202420252022

Data from NVIDIA datasheets and DeployBase GPU database as of March 21, 2026.


Pricing and Hourly Cost

Single-GPU Cloud Rental Prices (RunPod On-Demand)

GPUVRAM$/Hour$/Month (730 hrs)Annual
A100 PCIe80GB$1.19$869$10,428
A100 SXM80GB$1.39$1,015$12,180
H100 PCIe80GB$1.99$1,453$17,440
H100 SXM80GB$2.69$1,964$23,568
H200141GB$3.59$2,621$31,452
B200192GB$5.98$4,365$52,380
RTX 409024GB$0.34$248$2,976

Pricing from RunPod official API as of March 21, 2026. Lambda pricing is 30-50% higher. AWS and Azure similarly premium.


Cost-Per-TFLOP Analysis

Raw throughput isn't useful without cost context. Cost-per-TFLOP reveals which GPU gives best compute bang-for-buck.

Calculation Method

Cost-per-TFLOP = Hourly rate / Peak TFLOPS

GPU$/HourFP32 TFLOPS$/TFLOP/hrEfficiency Rank
A100$1.1919.5$0.0613rd
H100$1.9960$0.0332nd
H200$3.5979$0.0454th
B200$5.98180$0.0332nd (tied)
RTX 4090$0.3482.6$0.0041st

Surprise result: RTX 4090 is most efficient per TFLOP. But it only has 24GB VRAM, limiting training workloads.

For practical training (large models, large batches), A100 is most efficient after accounting for memory constraints.


A100: The Workhorse

Specs

80GB HBM2e memory. 2.0 TB/s bandwidth. 19.5 TFLOPS FP32. Ampere architecture (released 2020).

Cost Profile

$1.19/hour on RunPod (cheapest option with real large-scale capability). $869/month for continuous training.

Training Performance

A100 trains a 7B parameter model in ~24-36 hours on a single GPU. Scales well to 8x GPUs via NVLink, achieving 95%+ parallel efficiency. 80GB memory handles batch sizes 32-64 for 7B models, 8-16 for 13B models.

Strengths

  • Proven infrastructure. Every cloud provider has A100 inventory.
  • Cost-effective. Cheapest dollar-per-TFLOP in practical training scenarios.
  • Memory-bandwidth sweet spot. Enough for most fine-tuning.
  • NVLink efficiency. Multi-GPU setups scale near-linearly.
  • Availability. Easier to book 8x A100 than 8x H100.
  • Mature software ecosystem. CUDA 12, cuDNN 8.6+ fully optimized.

Weaknesses

  • Slow training for large models (30B+). Batch sizes limited by 80GB VRAM.
  • Bandwidth bottleneck. 2.0 TB/s ceiling limits throughput on attention layers.
  • 2020 architecture. Latest optimization tricks (FlashAttention v3, etc.) not as efficient.
  • Slow on transformer layers with large sequence lengths (>2K tokens).

Best For

  • Fine-tuning existing models (Llama 7B/13B, Mistral)
  • Research on 7-13B scale models
  • Cost-sensitive projects with 2-4 week timelines
  • Batch sizes under 64 (per GPU)
  • Multi-GPU training where NVLink efficiency matters

H100: The Performance Leader

Specs

80GB HBM2e memory (PCIe variant). 2.0 TB/s bandwidth. 60 TFLOPS FP32. Hopper architecture (released 2023).

Cost Profile

$1.99/hour (PCIe) on RunPod. $1,453/month for continuous training.

Training Performance

H100 trains a 7B model in 8-12 hours on a single GPU. 3-4x faster than A100. 8x H100 SXM cluster achieves 95%+ efficiency via NVLink at 900 GB/s. Batch sizes match A100 (80GB VRAM same as A100), but computational throughput per batch is 3x higher.

Training larger models (13B-70B) is practical. 70B model trains in ~60-80 hours on 8x H100 SXM.

Strengths

  • 3-4x faster than A100 for same cost structure.
  • NVLink on SXM variant (900 GB/s) enables true multi-GPU scaling.
  • Proven Hopper architecture. Mature software stack (CUDA 12, cuDNN 8.6+).
  • Inference speed. H100 also faster for batch serving (not just training).
  • Tensor Float 32 (TF32) precision optimizations for transformers.
  • Better memory latency hiding (Hopper improvement over Ampere).

Weaknesses

  • $1.99/hr minimum (67% more than A100).
  • Still 80GB VRAM limit. No advantage for memory-heavy models.
  • NVLink requires SXM variant (more expensive, $2.69/hr vs $1.99/hr).
  • PCIe variant (cheaper) loses NVLink efficiency on multi-GPU jobs.
  • Availability can be constrained during peak demand.

Best For

  • Production training with strict timelines
  • 13B-30B model fine-tuning
  • Teams prioritizing speed over cost
  • Multi-GPU clusters (8+ GPUs) where NVLink efficiency matters
  • Time-critical research projects

H200: The Next Generation

Specs

141GB HBM3e memory. 4.8 TB/s bandwidth. 79 TFLOPS FP32. Hopper variant (released 2024).

Cost Profile

$3.59/hour on RunPod. $2,621/month for continuous training.

Training Performance

H200 matches H100 computational throughput (both Hopper). The advantage is 141GB VRAM (76% more than H100). Enables:

  • Larger batch sizes: 128-256 per GPU (vs 32-64 on H100)
  • Longer sequence lengths: 8K+ context without pipeline parallelism
  • Bigger models: 70B fine-tuning on single GPU becomes practical

Bandwidth doubles to 4.8 TB/s. Memory-bound operations (attention, gradient accumulation) run 2.4x faster than H100.

Strengths

  • Memory advantage is real. Single-GPU training for 70B models.
  • Bandwidth for memory-bound ops (attention with long sequences).
  • Same per-token computation cost as H100 but trains larger models faster.
  • Future-proof. New models optimized for HBM3e bandwidth.
  • Long-context training becomes practical without complex sharding.

Weaknesses

  • 3x cost of A100 ($3.59 vs $1.19/hr).
  • Overkill for small models (<13B). Extra memory unused.
  • Inventory constraints. Fewer H200 available than A100/H100.
  • Limited software optimization yet. FlashAttention v3 and similar tools still maturing on H200.

Best For

  • Large model fine-tuning (30B-70B single GPU)
  • Research requiring long context windows (8K+)
  • Time-sensitive production projects with big budgets
  • Teams training for inference (batch size matters more than throughput)

B200: Large-Scale Flagship

Specs

192GB HBM3e memory. 8.0 TB/s bandwidth. 180 TFLOPS FP32. Blackwell architecture (released 2025, limited availability).

Cost Profile

$5.98/hour on RunPod. $4,365/month for continuous training. Availability extremely limited.

Training Performance

B200 is the newest hardware but not universally better. Same problem as A100 vs H100: it's not faster per-token, it's more powerful per-GPU. 192GB memory enables:

  • 405B LLaMA on 2 GPUs (vs 4-8x H100 for same)
  • Massive batch sizes (512+)
  • Full-model training without sharding

Bandwidth at 8.0 TB/s dominates for attention layers. Single-GPU attention training 5-10x faster than A100.

Strengths

  • Largest VRAM (192GB). Only option for very large models.
  • Fastest for memory-bound operations.
  • Latest architecture. Best long-term investment.
  • Blackwell arch improvements in power efficiency (1000W vs H100's 350W, but more compute per watt).

Weaknesses

  • 5x cost of H100 ($5.98 vs $1.99/hr).
  • Not 5x faster. Teams are paying for memory, not speed.
  • Availability extremely scarce (March 2026).
  • Software ecosystem still maturing.
  • Power requirements massive (requires specialized infrastructure).

Best For

  • Training 70B+ models from scratch
  • Massive batch inference serving (1M+ req/day)
  • Enterprises with unlimited budgets
  • Foundation model development

RTX 4090: Budget Training

Specs

24GB GDDR6X memory. 1.0 TB/s bandwidth. 82.6 TFLOPS FP32. Ada architecture.

Cost Profile

$0.34/hour on RunPod. $248/month for continuous training.

Training Performance

RTX 4090 is designed for gaming, not training. But it's a legitimate option for:

  • Fine-tuning small models (3B-7B)
  • Prototype training before scaling
  • Cost-conscious research

24GB memory limits batch sizes to 8-16. Trains 7B model in 40-60 hours (4-5x slower than A100). Not suitable for larger models without gradient checkpointing and other memory tricks.

Strengths

  • Cheap. $0.34/hr is 70% cheaper than A100.
  • Available everywhere (RTX 4090 is common).
  • Adequate for small models and fine-tuning.
  • Good price-to-TFLOP for inference.

Weaknesses

  • 24GB memory is tight. Limits model size and batch size.
  • GDDR6X has 1.0 TB/s bandwidth (5x slower than A100 HBM2e).
  • No NVLink. Multi-GPU training via PCIe is slow.
  • Not designed for 24/7 training (gaming hardware).
  • GDDR6X memory is consumer-grade, less reliable for long training runs.

Best For

  • Budget-first experiments
  • Fine-tuning 3B-7B models
  • Prototyping before scaling to A100
  • Teaching/learning (low stakes)

Multi-GPU Interconnect Analysis

NVLink is NVIDIA's GPU-to-GPU interconnect. It provides 900 GB/s bandwidth between GPUs on high-end models (H100 SXM, A100 SXM). This is critical for multi-GPU training because gradients and activations must be communicated between GPUs constantly.

8x A100 SXM with NVLink: 95%+ parallel efficiency. Each GPU works on roughly 1/8 of the model. Gradient communication happens at NVLink speeds (900 GB/s). Training throughput: 8-12 hours for 7B model, 60-80 hours for 70B model.

8x H100 PCIe (no NVLink): 60-70% parallel efficiency. PCIe 5.0 provides 256 GB/s bandwidth. Gradient communication slower. More idle time waiting for network. Training throughput: similar per-hour cost but slower wall-clock time.

Same hourly cost via different efficiency. H100 PCIe + multi-GPU is slow. H100 SXM + NVLink is fast.

For large-scale training (30B+ models), NVLink efficiency cuts training time by 30-50%. This matters when wall-clock time is critical.

Interconnect Comparison

InterconnectBandwidthLatencyBest For
PCIe 4.064 GB/s~1-2 µsSingle-GPU, small clusters
PCIe 5.0256 GB/s~0.5-1 µs2-4 GPU clusters
NVLink (Ampere)200 GB/s per GPU~0.2 µs8+ GPU clusters (A100)
NVLink (Hopper)900 GB/s per GPU~0.1 µs8+ GPU clusters (H100/H200)
NVLink (Blackwell)1.8 TB/s per GPU~0.05 µs16+ GPU clusters (B200)

H100 NVLink (900 GB/s) is 3.5x faster than PCIe 5.0. This compounds across many GPUs.

Cost of Multi-GPU Training

8x A100 SXM, training a 7B model (24 hours):

  • Cost: $1.39 × 8 × 24 = $267
  • Throughput: ~100M tokens/hour (combined across 8 GPUs)
  • Total tokens: 2.4B tokens trained
  • Cost per 1B tokens trained: $111

8x H100 SXM, training same 7B model (12 hours):

  • Cost: $2.69 × 8 × 12 = $259
  • Throughput: ~200M tokens/hour (combined)
  • Total tokens: 2.4B tokens trained
  • Cost per 1B tokens trained: $108

Similar total cost, but H100 trains in half the time. For time-sensitive projects, H100 wins. For budget-constrained projects, A100 wins.


Training Time Estimates

Single-GPU Training Times (Approximate)

ModelA100 (80GB)H100 (80GB)H200 (141GB)B200 (192GB)
7B (1 epoch, 1B tokens)12-24 hrs3-8 hrs2-5 hrs1-2 hrs
13B (1 epoch, 2B tokens)24-48 hrs8-16 hrs5-12 hrs2-5 hrs
30B (1 epoch, 5B tokens)60-120 hrs20-40 hrs12-24 hrs5-10 hrs
70B (1 epoch, 10B tokens)150-300 hrs50-100 hrs30-60 hrs10-20 hrs

Times assume:

  • Standard transformer training loop
  • Batch size appropriate to model size
  • No gradient checkpointing or other tricks
  • Single precision (FP32)

Multi-GPU Training Times (8x Cluster)

Add 10-20% overhead for gradient synchronization and communication.

Model8x A1008x H1008x H200
7B2-4 hrs0.5-1 hr0.3-0.6 hr
13B3-6 hrs1-2 hrs0.6-1.5 hrs
30B8-15 hrs3-5 hrs1.5-3 hrs
70B20-40 hrs7-12 hrs4-8 hrs

Use Case Recommendations

Fine-Tuning a 7B Model on Custom Data

Start with A100 or H100.

A100: $1.19/hr × 24 hrs = $28.50. Train a 7B model in 24-36 hours. H100: $1.99/hr × 8 hrs = $16. Train same model 3-4x faster.

If teams have 1 week to complete, A100 is fine. If teams need results in 24 hours, H100.

Training 70B Model from Scratch

Need multi-GPU setup. H200 or B200 on single machine, or 8x H100/H200 cluster.

8x H100 SXM: $2.69 × 8 × 200 hrs = $4,304. Trains 70B LLaMA in ~200 hours. 2x H200: $3.59 × 2 × 150 hrs = $1,077. Trains same model in ~150 hours (less efficient due to smaller cluster). 1x B200: Not practical. Can't fit distributed training orchestration on single GPU.

For budget: 8x H100. For speed: 8x H200 or 16x H100.

Research Project with Tight Budget

Use A100 or RTX 4090 clusters.

4x RTX 4090: $0.34 × 4 × 48 hrs = $65.28. Fine-tunes small models in prototype phase. 4x A100: $1.19 × 4 × 48 hrs = $228.48. Trains larger models faster, but 3.5x cost.


Real-World Training Scenarios

Scenario 1: Fine-Tune Llama 7B on Internal Documentation (24 Hours)

Assumptions:

  • 10M tokens of custom training data
  • Batch size 32
  • 4 epochs

Single A100: ~24 hours, $1.19 × 24 = $28.50 Single H100: ~6 hours, $1.99 × 6 = $12

H100 is cheaper by wall-clock time despite higher hourly rate.

Scenario 2: Train 13B Model from Scratch (1 Week Timeline)

Assumptions:

  • 100B tokens corpus
  • Batch size 256
  • 1 epoch

4x A100 SXM: $1.39 × 4 × 168 hrs = $934 4x H100 SXM: $2.69 × 4 × 84 hrs = $904

Similar cost. H100 finishes in 3.5 days vs 7 days for A100. H100 more valuable if iteration speed matters.

Scenario 3: Continuous Fine-Tuning Service (Monthly)

Assume 50 fine-tuning jobs per month, 10 hours each, 7B models.

A100: $1.19 × 50 × 10 = $595/month H100: $1.99 × 50 × 10 = $995/month H200: $3.59 × 50 × 10 = $1,795/month

A100 is cost-effective for continuous workloads. H100 only if time-to-value matters more than monthly spend.

Scenario 4: Large Foundation Model Pre-training (405B LLaMA)

Need large cluster. B200 or 16x H100.

1x B200: Can't fit 405B with optimizer state. Need at least 2x B200. 2x B200: $5.98 × 2 × 400 hrs = $4,784 (rough estimate for 405B from scratch) 16x H100: $2.69 × 16 × 500 hrs = $21,520 (same model, 10x higher cost)

B200 wins for massive foundation models, but only if teams have 16+ model shards and distributed training framework.


FAQ

Should I use A100 or H100 for fine-tuning? A100 for cost efficiency. H100 if you need results fast and can afford $0.80/hr premium. For most teams, A100 is fine for fine-tuning.

Is H200 worth 3x the cost of A100? Only if you're training models over 30B parameters or need long-context training. For 7B-13B, A100 is sufficient.

Can I mix GPU types in a cluster? Not recommended. Heterogeneous clusters (A100 + H100) have efficiency loss due to uneven throughput. Use homogeneous clusters.

What about AMD MI300X? AMD is cheaper ($1.50-2.00/hr) but software ecosystem is immature. CUDA is standard. Use NVIDIA unless ROI from cost savings justifies AMD risk.

How much faster is H100 than A100? Per-GPU throughput is 3-4x higher. Wall-clock training time is 50-70% faster. Scales non-linearly with cluster size due to communication overhead.

Is B200 future-proof for training? Yes, but at premium cost. Unless you're training 405B-class models, H200 or H100 is sufficient for 2026-2028.

What's the break-even point between buying and renting? Rent if under 500 GPU-hours/month. Buy if over 1,500 GPU-hours/month (continuous 24/7 use). Break-even: roughly 12 months on a $15K A100.

Should I use spot instances to save cost? Spot instances can save 50-70% on hourly rate. But interruption risk is high during peak demand. Use for batch jobs, not interactive training.



Sources