Contents
- Best GPU for LLM Training: Overview
- GPU Specification Comparison
- Pricing and Hourly Cost
- Cost-Per-TFLOP Analysis
- A100: The Workhorse
- H100: The Performance Leader
- H200: The Next Generation
- B200: large-scale Flagship
- RTX 4090: Budget Training
- Multi-GPU Interconnect Analysis
- Training Time Estimates
- Use Case Recommendations
- Real-World Training Scenarios
- FAQ
- Related Resources
- Sources
Best GPU for LLM Training: Overview
LLM training GPU choice = model size + budget + time.
A100: proven, 80GB, cost-effective. H100: 3x faster, NVLink, 50-70% time savings. H200: 141GB memory. B200: newest, 5x H100 cost.
Startups: A100. Time-critical: H100. Match GPU to model size and budget.
GPU Specification Comparison
| Specification | A100 PCIe | H100 PCIe | H200 | B200 | RTX 4090 |
|---|---|---|---|---|---|
| Memory (VRAM) | 80GB | 80GB | 141GB | 192GB | 24GB |
| Memory Type | HBM2e | HBM2e | HBM3e | HBM3e | GDDR6X |
| Bandwidth | 2.0 TB/s | 2.0 TB/s | 4.8 TB/s | 8.0 TB/s | 1.0 TB/s |
| FP32 Throughput | 19.5 TFLOPS | 60 TFLOPS | 79 TFLOPS | 180 TFLOPS | 82.6 TFLOPS |
| Tensor Float 32 | 156 TFLOPS | 1.4 PETA | 1.8 PETA | 5.3 PETA | 1.3 PETA |
| TDP | 250W | 350W | 575W | 1,000W | 575W |
| Architecture | Ampere | Hopper | Hopper | Blackwell | Ada |
| NVLink Support | Yes (200 GB/s) | Yes (900 GB/s) | Yes (900 GB/s) | Yes (1.8 TB/s) | No (PCIe only) |
| Release Year | 2020 | 2023 | 2024 | 2025 | 2022 |
Data from NVIDIA datasheets and DeployBase GPU database as of March 21, 2026.
Pricing and Hourly Cost
Single-GPU Cloud Rental Prices (RunPod On-Demand)
| GPU | VRAM | $/Hour | $/Month (730 hrs) | Annual |
|---|---|---|---|---|
| A100 PCIe | 80GB | $1.19 | $869 | $10,428 |
| A100 SXM | 80GB | $1.39 | $1,015 | $12,180 |
| H100 PCIe | 80GB | $1.99 | $1,453 | $17,440 |
| H100 SXM | 80GB | $2.69 | $1,964 | $23,568 |
| H200 | 141GB | $3.59 | $2,621 | $31,452 |
| B200 | 192GB | $5.98 | $4,365 | $52,380 |
| RTX 4090 | 24GB | $0.34 | $248 | $2,976 |
Pricing from RunPod official API as of March 21, 2026. Lambda pricing is 30-50% higher. AWS and Azure similarly premium.
Cost-Per-TFLOP Analysis
Raw throughput isn't useful without cost context. Cost-per-TFLOP reveals which GPU gives best compute bang-for-buck.
Calculation Method
Cost-per-TFLOP = Hourly rate / Peak TFLOPS
| GPU | $/Hour | FP32 TFLOPS | $/TFLOP/hr | Efficiency Rank |
|---|---|---|---|---|
| A100 | $1.19 | 19.5 | $0.061 | 3rd |
| H100 | $1.99 | 60 | $0.033 | 2nd |
| H200 | $3.59 | 79 | $0.045 | 4th |
| B200 | $5.98 | 180 | $0.033 | 2nd (tied) |
| RTX 4090 | $0.34 | 82.6 | $0.004 | 1st |
Surprise result: RTX 4090 is most efficient per TFLOP. But it only has 24GB VRAM, limiting training workloads.
For practical training (large models, large batches), A100 is most efficient after accounting for memory constraints.
A100: The Workhorse
Specs
80GB HBM2e memory. 2.0 TB/s bandwidth. 19.5 TFLOPS FP32. Ampere architecture (released 2020).
Cost Profile
$1.19/hour on RunPod (cheapest option with real large-scale capability). $869/month for continuous training.
Training Performance
A100 trains a 7B parameter model in ~24-36 hours on a single GPU. Scales well to 8x GPUs via NVLink, achieving 95%+ parallel efficiency. 80GB memory handles batch sizes 32-64 for 7B models, 8-16 for 13B models.
Strengths
- Proven infrastructure. Every cloud provider has A100 inventory.
- Cost-effective. Cheapest dollar-per-TFLOP in practical training scenarios.
- Memory-bandwidth sweet spot. Enough for most fine-tuning.
- NVLink efficiency. Multi-GPU setups scale near-linearly.
- Availability. Easier to book 8x A100 than 8x H100.
- Mature software ecosystem. CUDA 12, cuDNN 8.6+ fully optimized.
Weaknesses
- Slow training for large models (30B+). Batch sizes limited by 80GB VRAM.
- Bandwidth bottleneck. 2.0 TB/s ceiling limits throughput on attention layers.
- 2020 architecture. Latest optimization tricks (FlashAttention v3, etc.) not as efficient.
- Slow on transformer layers with large sequence lengths (>2K tokens).
Best For
- Fine-tuning existing models (Llama 7B/13B, Mistral)
- Research on 7-13B scale models
- Cost-sensitive projects with 2-4 week timelines
- Batch sizes under 64 (per GPU)
- Multi-GPU training where NVLink efficiency matters
H100: The Performance Leader
Specs
80GB HBM2e memory (PCIe variant). 2.0 TB/s bandwidth. 60 TFLOPS FP32. Hopper architecture (released 2023).
Cost Profile
$1.99/hour (PCIe) on RunPod. $1,453/month for continuous training.
Training Performance
H100 trains a 7B model in 8-12 hours on a single GPU. 3-4x faster than A100. 8x H100 SXM cluster achieves 95%+ efficiency via NVLink at 900 GB/s. Batch sizes match A100 (80GB VRAM same as A100), but computational throughput per batch is 3x higher.
Training larger models (13B-70B) is practical. 70B model trains in ~60-80 hours on 8x H100 SXM.
Strengths
- 3-4x faster than A100 for same cost structure.
- NVLink on SXM variant (900 GB/s) enables true multi-GPU scaling.
- Proven Hopper architecture. Mature software stack (CUDA 12, cuDNN 8.6+).
- Inference speed. H100 also faster for batch serving (not just training).
- Tensor Float 32 (TF32) precision optimizations for transformers.
- Better memory latency hiding (Hopper improvement over Ampere).
Weaknesses
- $1.99/hr minimum (67% more than A100).
- Still 80GB VRAM limit. No advantage for memory-heavy models.
- NVLink requires SXM variant (more expensive, $2.69/hr vs $1.99/hr).
- PCIe variant (cheaper) loses NVLink efficiency on multi-GPU jobs.
- Availability can be constrained during peak demand.
Best For
- Production training with strict timelines
- 13B-30B model fine-tuning
- Teams prioritizing speed over cost
- Multi-GPU clusters (8+ GPUs) where NVLink efficiency matters
- Time-critical research projects
H200: The Next Generation
Specs
141GB HBM3e memory. 4.8 TB/s bandwidth. 79 TFLOPS FP32. Hopper variant (released 2024).
Cost Profile
$3.59/hour on RunPod. $2,621/month for continuous training.
Training Performance
H200 matches H100 computational throughput (both Hopper). The advantage is 141GB VRAM (76% more than H100). Enables:
- Larger batch sizes: 128-256 per GPU (vs 32-64 on H100)
- Longer sequence lengths: 8K+ context without pipeline parallelism
- Bigger models: 70B fine-tuning on single GPU becomes practical
Bandwidth doubles to 4.8 TB/s. Memory-bound operations (attention, gradient accumulation) run 2.4x faster than H100.
Strengths
- Memory advantage is real. Single-GPU training for 70B models.
- Bandwidth for memory-bound ops (attention with long sequences).
- Same per-token computation cost as H100 but trains larger models faster.
- Future-proof. New models optimized for HBM3e bandwidth.
- Long-context training becomes practical without complex sharding.
Weaknesses
- 3x cost of A100 ($3.59 vs $1.19/hr).
- Overkill for small models (<13B). Extra memory unused.
- Inventory constraints. Fewer H200 available than A100/H100.
- Limited software optimization yet. FlashAttention v3 and similar tools still maturing on H200.
Best For
- Large model fine-tuning (30B-70B single GPU)
- Research requiring long context windows (8K+)
- Time-sensitive production projects with big budgets
- Teams training for inference (batch size matters more than throughput)
B200: Large-Scale Flagship
Specs
192GB HBM3e memory. 8.0 TB/s bandwidth. 180 TFLOPS FP32. Blackwell architecture (released 2025, limited availability).
Cost Profile
$5.98/hour on RunPod. $4,365/month for continuous training. Availability extremely limited.
Training Performance
B200 is the newest hardware but not universally better. Same problem as A100 vs H100: it's not faster per-token, it's more powerful per-GPU. 192GB memory enables:
- 405B LLaMA on 2 GPUs (vs 4-8x H100 for same)
- Massive batch sizes (512+)
- Full-model training without sharding
Bandwidth at 8.0 TB/s dominates for attention layers. Single-GPU attention training 5-10x faster than A100.
Strengths
- Largest VRAM (192GB). Only option for very large models.
- Fastest for memory-bound operations.
- Latest architecture. Best long-term investment.
- Blackwell arch improvements in power efficiency (1000W vs H100's 350W, but more compute per watt).
Weaknesses
- 5x cost of H100 ($5.98 vs $1.99/hr).
- Not 5x faster. Teams are paying for memory, not speed.
- Availability extremely scarce (March 2026).
- Software ecosystem still maturing.
- Power requirements massive (requires specialized infrastructure).
Best For
- Training 70B+ models from scratch
- Massive batch inference serving (1M+ req/day)
- Enterprises with unlimited budgets
- Foundation model development
RTX 4090: Budget Training
Specs
24GB GDDR6X memory. 1.0 TB/s bandwidth. 82.6 TFLOPS FP32. Ada architecture.
Cost Profile
$0.34/hour on RunPod. $248/month for continuous training.
Training Performance
RTX 4090 is designed for gaming, not training. But it's a legitimate option for:
- Fine-tuning small models (3B-7B)
- Prototype training before scaling
- Cost-conscious research
24GB memory limits batch sizes to 8-16. Trains 7B model in 40-60 hours (4-5x slower than A100). Not suitable for larger models without gradient checkpointing and other memory tricks.
Strengths
- Cheap. $0.34/hr is 70% cheaper than A100.
- Available everywhere (RTX 4090 is common).
- Adequate for small models and fine-tuning.
- Good price-to-TFLOP for inference.
Weaknesses
- 24GB memory is tight. Limits model size and batch size.
- GDDR6X has 1.0 TB/s bandwidth (5x slower than A100 HBM2e).
- No NVLink. Multi-GPU training via PCIe is slow.
- Not designed for 24/7 training (gaming hardware).
- GDDR6X memory is consumer-grade, less reliable for long training runs.
Best For
- Budget-first experiments
- Fine-tuning 3B-7B models
- Prototyping before scaling to A100
- Teaching/learning (low stakes)
Multi-GPU Interconnect Analysis
NVLink Efficiency Matters
NVLink is NVIDIA's GPU-to-GPU interconnect. It provides 900 GB/s bandwidth between GPUs on high-end models (H100 SXM, A100 SXM). This is critical for multi-GPU training because gradients and activations must be communicated between GPUs constantly.
8x A100 SXM with NVLink: 95%+ parallel efficiency. Each GPU works on roughly 1/8 of the model. Gradient communication happens at NVLink speeds (900 GB/s). Training throughput: 8-12 hours for 7B model, 60-80 hours for 70B model.
8x H100 PCIe (no NVLink): 60-70% parallel efficiency. PCIe 5.0 provides 256 GB/s bandwidth. Gradient communication slower. More idle time waiting for network. Training throughput: similar per-hour cost but slower wall-clock time.
Same hourly cost via different efficiency. H100 PCIe + multi-GPU is slow. H100 SXM + NVLink is fast.
For large-scale training (30B+ models), NVLink efficiency cuts training time by 30-50%. This matters when wall-clock time is critical.
Interconnect Comparison
| Interconnect | Bandwidth | Latency | Best For |
|---|---|---|---|
| PCIe 4.0 | 64 GB/s | ~1-2 µs | Single-GPU, small clusters |
| PCIe 5.0 | 256 GB/s | ~0.5-1 µs | 2-4 GPU clusters |
| NVLink (Ampere) | 200 GB/s per GPU | ~0.2 µs | 8+ GPU clusters (A100) |
| NVLink (Hopper) | 900 GB/s per GPU | ~0.1 µs | 8+ GPU clusters (H100/H200) |
| NVLink (Blackwell) | 1.8 TB/s per GPU | ~0.05 µs | 16+ GPU clusters (B200) |
H100 NVLink (900 GB/s) is 3.5x faster than PCIe 5.0. This compounds across many GPUs.
Cost of Multi-GPU Training
8x A100 SXM, training a 7B model (24 hours):
- Cost: $1.39 × 8 × 24 = $267
- Throughput: ~100M tokens/hour (combined across 8 GPUs)
- Total tokens: 2.4B tokens trained
- Cost per 1B tokens trained: $111
8x H100 SXM, training same 7B model (12 hours):
- Cost: $2.69 × 8 × 12 = $259
- Throughput: ~200M tokens/hour (combined)
- Total tokens: 2.4B tokens trained
- Cost per 1B tokens trained: $108
Similar total cost, but H100 trains in half the time. For time-sensitive projects, H100 wins. For budget-constrained projects, A100 wins.
Training Time Estimates
Single-GPU Training Times (Approximate)
| Model | A100 (80GB) | H100 (80GB) | H200 (141GB) | B200 (192GB) |
|---|---|---|---|---|
| 7B (1 epoch, 1B tokens) | 12-24 hrs | 3-8 hrs | 2-5 hrs | 1-2 hrs |
| 13B (1 epoch, 2B tokens) | 24-48 hrs | 8-16 hrs | 5-12 hrs | 2-5 hrs |
| 30B (1 epoch, 5B tokens) | 60-120 hrs | 20-40 hrs | 12-24 hrs | 5-10 hrs |
| 70B (1 epoch, 10B tokens) | 150-300 hrs | 50-100 hrs | 30-60 hrs | 10-20 hrs |
Times assume:
- Standard transformer training loop
- Batch size appropriate to model size
- No gradient checkpointing or other tricks
- Single precision (FP32)
Multi-GPU Training Times (8x Cluster)
Add 10-20% overhead for gradient synchronization and communication.
| Model | 8x A100 | 8x H100 | 8x H200 |
|---|---|---|---|
| 7B | 2-4 hrs | 0.5-1 hr | 0.3-0.6 hr |
| 13B | 3-6 hrs | 1-2 hrs | 0.6-1.5 hrs |
| 30B | 8-15 hrs | 3-5 hrs | 1.5-3 hrs |
| 70B | 20-40 hrs | 7-12 hrs | 4-8 hrs |
Use Case Recommendations
Fine-Tuning a 7B Model on Custom Data
Start with A100 or H100.
A100: $1.19/hr × 24 hrs = $28.50. Train a 7B model in 24-36 hours. H100: $1.99/hr × 8 hrs = $16. Train same model 3-4x faster.
If teams have 1 week to complete, A100 is fine. If teams need results in 24 hours, H100.
Training 70B Model from Scratch
Need multi-GPU setup. H200 or B200 on single machine, or 8x H100/H200 cluster.
8x H100 SXM: $2.69 × 8 × 200 hrs = $4,304. Trains 70B LLaMA in ~200 hours. 2x H200: $3.59 × 2 × 150 hrs = $1,077. Trains same model in ~150 hours (less efficient due to smaller cluster). 1x B200: Not practical. Can't fit distributed training orchestration on single GPU.
For budget: 8x H100. For speed: 8x H200 or 16x H100.
Research Project with Tight Budget
Use A100 or RTX 4090 clusters.
4x RTX 4090: $0.34 × 4 × 48 hrs = $65.28. Fine-tunes small models in prototype phase. 4x A100: $1.19 × 4 × 48 hrs = $228.48. Trains larger models faster, but 3.5x cost.
Real-World Training Scenarios
Scenario 1: Fine-Tune Llama 7B on Internal Documentation (24 Hours)
Assumptions:
- 10M tokens of custom training data
- Batch size 32
- 4 epochs
Single A100: ~24 hours, $1.19 × 24 = $28.50 Single H100: ~6 hours, $1.99 × 6 = $12
H100 is cheaper by wall-clock time despite higher hourly rate.
Scenario 2: Train 13B Model from Scratch (1 Week Timeline)
Assumptions:
- 100B tokens corpus
- Batch size 256
- 1 epoch
4x A100 SXM: $1.39 × 4 × 168 hrs = $934 4x H100 SXM: $2.69 × 4 × 84 hrs = $904
Similar cost. H100 finishes in 3.5 days vs 7 days for A100. H100 more valuable if iteration speed matters.
Scenario 3: Continuous Fine-Tuning Service (Monthly)
Assume 50 fine-tuning jobs per month, 10 hours each, 7B models.
A100: $1.19 × 50 × 10 = $595/month H100: $1.99 × 50 × 10 = $995/month H200: $3.59 × 50 × 10 = $1,795/month
A100 is cost-effective for continuous workloads. H100 only if time-to-value matters more than monthly spend.
Scenario 4: Large Foundation Model Pre-training (405B LLaMA)
Need large cluster. B200 or 16x H100.
1x B200: Can't fit 405B with optimizer state. Need at least 2x B200. 2x B200: $5.98 × 2 × 400 hrs = $4,784 (rough estimate for 405B from scratch) 16x H100: $2.69 × 16 × 500 hrs = $21,520 (same model, 10x higher cost)
B200 wins for massive foundation models, but only if teams have 16+ model shards and distributed training framework.
FAQ
Should I use A100 or H100 for fine-tuning? A100 for cost efficiency. H100 if you need results fast and can afford $0.80/hr premium. For most teams, A100 is fine for fine-tuning.
Is H200 worth 3x the cost of A100? Only if you're training models over 30B parameters or need long-context training. For 7B-13B, A100 is sufficient.
Can I mix GPU types in a cluster? Not recommended. Heterogeneous clusters (A100 + H100) have efficiency loss due to uneven throughput. Use homogeneous clusters.
What about AMD MI300X? AMD is cheaper ($1.50-2.00/hr) but software ecosystem is immature. CUDA is standard. Use NVIDIA unless ROI from cost savings justifies AMD risk.
How much faster is H100 than A100? Per-GPU throughput is 3-4x higher. Wall-clock training time is 50-70% faster. Scales non-linearly with cluster size due to communication overhead.
Is B200 future-proof for training? Yes, but at premium cost. Unless you're training 405B-class models, H200 or H100 is sufficient for 2026-2028.
What's the break-even point between buying and renting? Rent if under 500 GPU-hours/month. Buy if over 1,500 GPU-hours/month (continuous 24/7 use). Break-even: roughly 12 months on a $15K A100.
Should I use spot instances to save cost? Spot instances can save 50-70% on hourly rate. But interruption risk is high during peak demand. Use for batch jobs, not interactive training.
Related Resources
- GPU Pricing Comparison
- AI Image Generation GPU Guide
- Cheapest GPT-4 Alternative
- AI Infrastructure for Startups
- RunPod GPU Pricing