Best GPU for LLM Training: A100, H100, H200 Compared

Best GPU for LLM Training: Overview
GPU Specification Comparison
Pricing and Hourly Cost
Cost-Per-TFLOP Analysis
A100: The Workhorse
H100: The Performance Leader
H200: The Next Generation
B200: Large-Scale Flagship
RTX 4090: Budget Training
Multi-GPU Interconnect Analysis
Training Time Estimates
Use Case Recommendations
Real-World Training Scenarios
FAQ
Related Resources
Sources

Best GPU for LLM Training: Overview

LLM training GPU choice = model size + budget + time.

A100: proven, 80GB, cost-effective. H100: 3x faster, NVLink, 50-70% time savings. H200: 141GB memory. B200: newest, 5x H100 cost.

Startups: A100. Time-critical: H100. Match GPU to model size and budget.

GPU Specification Comparison

Specification	A100 PCIe	H100 PCIe	H200	B200	RTX 4090
Memory (VRAM)	80GB	80GB	141GB	192GB	24GB
Memory Type	HBM2e	HBM2e	HBM3e	HBM3e	GDDR6X
Bandwidth	2.0 TB/s	2.0 TB/s	4.8 TB/s	8.0 TB/s	1.0 TB/s
FP32 Throughput	19.5 TFLOPS	60 TFLOPS	79 TFLOPS	180 TFLOPS	82.6 TFLOPS
Tensor Float 32	156 TFLOPS	1.4 PETA	1.8 PETA	5.3 PETA	1.3 PETA
TDP	250W	350W	575W	1,000W	575W
Architecture	Ampere	Hopper	Hopper	Blackwell	Ada
NVLink Support	Yes (200 GB/s)	Yes (900 GB/s)	Yes (900 GB/s)	Yes (1.8 TB/s)	No (PCIe only)
Release Year	2020	2023	2024	2025	2022

Data from NVIDIA datasheets and DeployBase GPU database as of March 21, 2026.

Pricing and Hourly Cost

Single-GPU Cloud Rental Prices (RunPod On-Demand)

GPU	VRAM	$/Hour	$/Month (730 hrs)	Annual
A100 PCIe	80GB	$1.19	$869	$10,428
A100 SXM	80GB	$1.39	$1,015	$12,180
H100 PCIe	80GB	$1.99	$1,453	$17,440
H100 SXM	80GB	$2.69	$1,964	$23,568
H200	141GB	$3.59	$2,621	$31,452
B200	192GB	$5.98	$4,365	$52,380
RTX 4090	24GB	$0.34	$248	$2,976

Pricing from RunPod official API as of March 21, 2026. Lambda pricing is 30-50% higher. AWS and Azure similarly premium.

Cost-Per-TFLOP Analysis

Raw throughput isn't useful without cost context. Cost-per-TFLOP reveals which GPU gives best compute bang-for-buck.

Calculation Method

Cost-per-TFLOP = Hourly rate / Peak TFLOPS

GPU	$/Hour	FP32 TFLOPS	$/TFLOP/hr	Efficiency Rank
A100	$1.19	19.5	$0.061	3rd
H100	$1.99	60	$0.033	2nd
H200	$3.59	79	$0.045	4th
B200	$5.98	180	$0.033	2nd (tied)
RTX 4090	$0.34	82.6	$0.004	1st

Surprise result: RTX 4090 is most efficient per TFLOP. But it only has 24GB VRAM, limiting training workloads.

For practical training (large models, large batches), A100 is most efficient after accounting for memory constraints.

A100: The Workhorse

Specs

80GB HBM2e memory. 2.0 TB/s bandwidth. 19.5 TFLOPS FP32. Ampere architecture (released 2020).

Cost Profile

$1.19/hour on RunPod (cheapest option with real large-scale capability). $869/month for continuous training.

Training Performance

A100 trains a 7B parameter model in ~24-36 hours on a single GPU. Scales well to 8x GPUs via NVLink, achieving 95%+ parallel efficiency. 80GB memory handles batch sizes 32-64 for 7B models, 8-16 for 13B models.

Strengths

Proven infrastructure. Every cloud provider has A100 inventory.
Cost-effective. Cheapest dollar-per-TFLOP in practical training scenarios.
Memory-bandwidth sweet spot. Enough for most fine-tuning.
NVLink efficiency. Multi-GPU setups scale near-linearly.
Availability. Easier to book 8x A100 than 8x H100.
Mature software ecosystem. CUDA 12, cuDNN 8.6+ fully optimized.

Weaknesses

Slow training for large models (30B+). Batch sizes limited by 80GB VRAM.
Bandwidth bottleneck. 2.0 TB/s ceiling limits throughput on attention layers.
2020 architecture. Latest optimization tricks (FlashAttention v3, etc.) not as efficient.
Slow on transformer layers with large sequence lengths (>2K tokens).

Best For

Fine-tuning existing models (Llama 7B/13B, Mistral)
Research on 7-13B scale models
Cost-sensitive projects with 2-4 week timelines
Batch sizes under 64 (per GPU)
Multi-GPU training where NVLink efficiency matters

H100: The Performance Leader

Specs

80GB HBM2e memory (PCIe variant). 2.0 TB/s bandwidth. 60 TFLOPS FP32. Hopper architecture (released 2023).

Cost Profile

$1.99/hour (PCIe) on RunPod. $1,453/month for continuous training.

Training Performance

H100 trains a 7B model in 8-12 hours on a single GPU. 3-4x faster than A100. 8x H100 SXM cluster achieves 95%+ efficiency via NVLink at 900 GB/s. Batch sizes match A100 (80GB VRAM same as A100), but computational throughput per batch is 3x higher.

Training larger models (13B-70B) is practical. 70B model trains in ~60-80 hours on 8x H100 SXM.

Strengths

3-4x faster than A100 for same cost structure.
NVLink on SXM variant (900 GB/s) enables true multi-GPU scaling.
Proven Hopper architecture. Mature software stack (CUDA 12, cuDNN 8.6+).
Inference speed. H100 also faster for batch serving (not just training).
Tensor Float 32 (TF32) precision optimizations for transformers.
Better memory latency hiding (Hopper improvement over Ampere).

Weaknesses

$1.99/hr minimum (67% more than A100).
Still 80GB VRAM limit. No advantage for memory-heavy models.
NVLink requires SXM variant (more expensive, $2.69/hr vs $1.99/hr).
PCIe variant (cheaper) loses NVLink efficiency on multi-GPU jobs.
Availability can be constrained during peak demand.

Best For

Production training with strict timelines
13B-30B model fine-tuning
Teams prioritizing speed over cost
Multi-GPU clusters (8+ GPUs) where NVLink efficiency matters
Time-critical research projects

H200: The Next Generation

Specs

141GB HBM3e memory. 4.8 TB/s bandwidth. 79 TFLOPS FP32. Hopper variant (released 2024).

Cost Profile

$3.59/hour on RunPod. $2,621/month for continuous training.

Training Performance

H200 matches H100 computational throughput (both Hopper). The advantage is 141GB VRAM (76% more than H100). Enables:

Larger batch sizes: 128-256 per GPU (vs 32-64 on H100)
Longer sequence lengths: 8K+ context without pipeline parallelism
Bigger models: 70B fine-tuning on single GPU becomes practical

Bandwidth doubles to 4.8 TB/s. Memory-bound operations (attention, gradient accumulation) run 2.4x faster than H100.

Strengths

Memory advantage is real. Single-GPU training for 70B models.
Bandwidth for memory-bound ops (attention with long sequences).
Same per-token computation cost as H100 but trains larger models faster.
Future-proof. New models optimized for HBM3e bandwidth.
Long-context training becomes practical without complex sharding.

Weaknesses

3x cost of A100 ($3.59 vs $1.19/hr).
Overkill for small models (<13B). Extra memory unused.
Inventory constraints. Fewer H200 available than A100/H100.
Limited software optimization yet. FlashAttention v3 and similar tools still maturing on H200.

Best For

Large model fine-tuning (30B-70B single GPU)
Research requiring long context windows (8K+)
Time-sensitive production projects with big budgets
Teams training for inference (batch size matters more than throughput)

B200: Large-Scale Flagship

Specs

192GB HBM3e memory. 8.0 TB/s bandwidth. 180 TFLOPS FP32. Blackwell architecture (released 2025, limited availability).

Cost Profile

$5.98/hour on RunPod. $4,365/month for continuous training. Availability extremely limited.

Training Performance

B200 is the newest hardware but not universally better. Same problem as A100 vs H100: it's not faster per-token, it's more powerful per-GPU. 192GB memory enables:

405B LLaMA on 2 GPUs (vs 4-8x H100 for same)
Massive batch sizes (512+)
Full-model training without sharding

Bandwidth at 8.0 TB/s dominates for attention layers. Single-GPU attention training 5-10x faster than A100.

Strengths

Largest VRAM (192GB). Only option for very large models.
Fastest for memory-bound operations.
Latest architecture. Best long-term investment.
Blackwell arch improvements in power efficiency (1000W vs H100's 350W, but more compute per watt).

Weaknesses

5x cost of H100 ($5.98 vs $1.99/hr).
Not 5x faster. Teams are paying for memory, not speed.
Availability extremely scarce (March 2026).
Software ecosystem still maturing.
Power requirements massive (requires specialized infrastructure).

Best For

Training 70B+ models from scratch
Massive batch inference serving (1M+ req/day)
Enterprises with unlimited budgets
Foundation model development

RTX 4090: Budget Training

Specs

24GB GDDR6X memory. 1.0 TB/s bandwidth. 82.6 TFLOPS FP32. Ada architecture.

Cost Profile

$0.34/hour on RunPod. $248/month for continuous training.

Training Performance

RTX 4090 is designed for gaming, not training. But it's a legitimate option for:

Fine-tuning small models (3B-7B)
Prototype training before scaling
Cost-conscious research

24GB memory limits batch sizes to 8-16. Trains 7B model in 40-60 hours (4-5x slower than A100). Not suitable for larger models without gradient checkpointing and other memory tricks.

Strengths

Cheap. $0.34/hr is 70% cheaper than A100.
Available everywhere (RTX 4090 is common).
Adequate for small models and fine-tuning.
Good price-to-TFLOP for inference.

Weaknesses

24GB memory is tight. Limits model size and batch size.
GDDR6X has 1.0 TB/s bandwidth (5x slower than A100 HBM2e).
No NVLink. Multi-GPU training via PCIe is slow.
Not designed for 24/7 training (gaming hardware).
GDDR6X memory is consumer-grade, less reliable for long training runs.

Best For

Budget-first experiments
Fine-tuning 3B-7B models
Prototyping before scaling to A100
Teaching/learning (low stakes)

Multi-GPU Interconnect Analysis

NVLink Efficiency Matters

NVLink is NVIDIA's GPU-to-GPU interconnect. It provides 900 GB/s bandwidth between GPUs on high-end models (H100 SXM, A100 SXM). This is critical for multi-GPU training because gradients and activations must be communicated between GPUs constantly.

8x A100 SXM with NVLink: 95%+ parallel efficiency. Each GPU works on roughly 1/8 of the model. Gradient communication happens at NVLink speeds (900 GB/s). Training throughput: 8-12 hours for 7B model, 60-80 hours for 70B model.

8x H100 PCIe (no NVLink): 60-70% parallel efficiency. PCIe 5.0 provides 256 GB/s bandwidth. Gradient communication slower. More idle time waiting for network. Training throughput: similar per-hour cost but slower wall-clock time.

Same hourly cost via different efficiency. H100 PCIe + multi-GPU is slow. H100 SXM + NVLink is fast.

For large-scale training (30B+ models), NVLink efficiency cuts training time by 30-50%. This matters when wall-clock time is critical.

Interconnect Comparison

Interconnect	Bandwidth	Latency	Best For
PCIe 4.0	64 GB/s	~1-2 µs	Single-GPU, small clusters
PCIe 5.0	256 GB/s	~0.5-1 µs	2-4 GPU clusters
NVLink (Ampere)	200 GB/s per GPU	~0.2 µs	8+ GPU clusters (A100)
NVLink (Hopper)	900 GB/s per GPU	~0.1 µs	8+ GPU clusters (H100/H200)
NVLink (Blackwell)	1.8 TB/s per GPU	~0.05 µs	16+ GPU clusters (B200)

H100 NVLink (900 GB/s) is 3.5x faster than PCIe 5.0. This compounds across many GPUs.

Cost of Multi-GPU Training

8x A100 SXM, training a 7B model (24 hours):

Cost: $1.39 × 8 × 24 = $267
Throughput: ~100M tokens/hour (combined across 8 GPUs)
Total tokens: 2.4B tokens trained
Cost per 1B tokens trained: $111

8x H100 SXM, training same 7B model (12 hours):

Cost: $2.69 × 8 × 12 = $259
Throughput: ~200M tokens/hour (combined)
Total tokens: 2.4B tokens trained
Cost per 1B tokens trained: $108

Similar total cost, but H100 trains in half the time. For time-sensitive projects, H100 wins. For budget-constrained projects, A100 wins.

Training Time Estimates

Single-GPU Training Times (Approximate)

Model	A100 (80GB)	H100 (80GB)	H200 (141GB)	B200 (192GB)
7B (1 epoch, 1B tokens)	12-24 hrs	3-8 hrs	2-5 hrs	1-2 hrs
13B (1 epoch, 2B tokens)	24-48 hrs	8-16 hrs	5-12 hrs	2-5 hrs
30B (1 epoch, 5B tokens)	60-120 hrs	20-40 hrs	12-24 hrs	5-10 hrs
70B (1 epoch, 10B tokens)	150-300 hrs	50-100 hrs	30-60 hrs	10-20 hrs

Times assume:

Standard transformer training loop
Batch size appropriate to model size
No gradient checkpointing or other tricks
Single precision (FP32)

Multi-GPU Training Times (8x Cluster)

Add 10-20% overhead for gradient synchronization and communication.

Model	8x A100	8x H100	8x H200
7B	2-4 hrs	0.5-1 hr	0.3-0.6 hr
13B	3-6 hrs	1-2 hrs	0.6-1.5 hrs
30B	8-15 hrs	3-5 hrs	1.5-3 hrs
70B	20-40 hrs	7-12 hrs	4-8 hrs

Use Case Recommendations

Fine-Tuning a 7B Model on Custom Data

Start with A100 or H100.

A100: $1.19/hr × 24 hrs = $28.50. Train a 7B model in 24-36 hours. H100: $1.99/hr × 8 hrs = $16. Train same model 3-4x faster.

If you have 1 week to complete, A100 is fine. If you need results in 24 hours, H100.

Training 70B Model from Scratch

Need multi-GPU setup. H200 or B200 on single machine, or 8x H100/H200 cluster.

8x H100 SXM: $2.69 × 8 × 200 hrs = $4,304. Trains 70B LLaMA in ~200 hours. 2x H200: $3.59 × 2 × 150 hrs = $1,077. Trains same model in ~150 hours (less efficient due to smaller cluster). 1x B200: Not practical. Can't fit distributed training orchestration on single GPU.

For budget: 8x H100. For speed: 8x H200 or 16x H100.

Research Project with Tight Budget

Use A100 or RTX 4090 clusters.

4x RTX 4090: $0.34 × 4 × 48 hrs = $65.28. Fine-tunes small models in prototype phase. 4x A100: $1.19 × 4 × 48 hrs = $228.48. Trains larger models faster, but 3.5x cost.

Real-World Training Scenarios

Scenario 1: Fine-Tune Llama 7B on Internal Documentation (24 Hours)

Assumptions:

10M tokens of custom training data
Batch size 32
4 epochs

Single A100: ~24 hours, $1.19 × 24 = $28.50 Single H100: ~6 hours, $1.99 × 6 = $12

H100 is cheaper by wall-clock time despite higher hourly rate.

Scenario 2: Train 13B Model from Scratch (1 Week Timeline)

Assumptions:

100B tokens corpus
Batch size 256
1 epoch

4x A100 SXM: $1.39 × 4 × 168 hrs = $934 4x H100 SXM: $2.69 × 4 × 84 hrs = $904

Similar cost. H100 finishes in 3.5 days vs 7 days for A100. H100 more valuable if iteration speed matters.

Scenario 3: Continuous Fine-Tuning Service (Monthly)

Assume 50 fine-tuning jobs per month, 10 hours each, 7B models.

A100: $1.19 × 50 × 10 = $595/month H100: $1.99 × 50 × 10 = $995/month H200: $3.59 × 50 × 10 = $1,795/month

A100 is cost-effective for continuous workloads. H100 only if time-to-value matters more than monthly spend.

Scenario 4: Large Foundation Model Pre-training (405B LLaMA)

Need large cluster. B200 or 16x H100.

1x B200: Can't fit 405B with optimizer state. Need at least 2x B200. 2x B200: $5.98 × 2 × 400 hrs = $4,784 (rough estimate for 405B from scratch) 16x H100: $2.69 × 16 × 500 hrs = $21,520 (same model, 10x higher cost)

B200 wins for massive foundation models, but only if teams have 16+ model shards and distributed training framework.

FAQ

Should I use A100 or H100 for fine-tuning? A100 for cost efficiency. H100 if you need results fast and can afford $0.80/hr premium. For most teams, A100 is fine for fine-tuning.

Is H200 worth 3x the cost of A100? Only if you're training models over 30B parameters or need long-context training. For 7B-13B, A100 is sufficient.

Can I mix GPU types in a cluster? Not recommended. Heterogeneous clusters (A100 + H100) have efficiency loss due to uneven throughput. Use homogeneous clusters.

What about AMD MI300X? AMD is cheaper ($1.50-2.00/hr) but software ecosystem is immature. CUDA is standard. Use NVIDIA unless ROI from cost savings justifies AMD risk.

How much faster is H100 than A100? Per-GPU throughput is 3-4x higher. Wall-clock training time is 50-70% faster. Scales non-linearly with cluster size due to communication overhead.

Is B200 future-proof for training? Yes, but at premium cost. Unless you're training 405B-class models, H200 or H100 is sufficient for 2026-2028.

What's the break-even point between buying and renting? Rent if under 500 GPU-hours/month. Buy if over 1,500 GPU-hours/month (continuous 24/7 use). Break-even: roughly 12 months on a $15K A100.

Should I use spot instances to save cost? Spot instances can save 50-70% on hourly rate. But interruption risk is high during peak demand. Use for batch jobs, not interactive training.

Contents

Best GPU for LLM Training: Overview

GPU Specification Comparison

Pricing and Hourly Cost

Single-GPU Cloud Rental Prices (RunPod On-Demand)

Cost-Per-TFLOP Analysis

Calculation Method

A100: The Workhorse

Specs

Cost Profile

Training Performance

Strengths

Weaknesses

Best For

H100: The Performance Leader

Specs

Cost Profile

Training Performance

Strengths

Weaknesses

Best For

H200: The Next Generation

Specs

Cost Profile

Training Performance

Strengths

Weaknesses

Best For

B200: Large-Scale Flagship

Specs

Cost Profile

Training Performance

Strengths

Weaknesses

Best For

RTX 4090: Budget Training

Specs

Cost Profile

Training Performance

Strengths

Weaknesses

Best For

Multi-GPU Interconnect Analysis

NVLink Efficiency Matters

Interconnect Comparison

Cost of Multi-GPU Training

Training Time Estimates

Single-GPU Training Times (Approximate)

Multi-GPU Training Times (8x Cluster)

Use Case Recommendations

Fine-Tuning a 7B Model on Custom Data

Training 70B Model from Scratch

Research Project with Tight Budget

Real-World Training Scenarios

Scenario 1: Fine-Tune Llama 7B on Internal Documentation (24 Hours)

Scenario 2: Train 13B Model from Scratch (1 Week Timeline)

Scenario 3: Continuous Fine-Tuning Service (Monthly)

Scenario 4: Large Foundation Model Pre-training (405B LLaMA)

FAQ

Related Resources

Sources