Contents
- How Many Gpus Train LLM: How Many GPUs Do Developers Need
- GPU Memory Requirements
- Distributed Training Strategies
- Training Time Calculations
- Cost Analysis by Model Size
- GPU Recommendation Matrix
- Optimization Techniques
- FAQ
- Related Resources
- Sources
How Many Gpus Train LLM: How Many GPUs Do Developers Need
How many gpus to train llm depends on model size, context length, batch size, and data. Single H100 trains 7B-13B fine. 70B+ needs 8-64 GPUs distributed. This breaks down the math: memory, time, and cost as of March 2026.
GPU Memory Requirements
Memory Formula
Training memory = Parameters + Optimizer State + Activations + Gradient Buffers
Rough calculation:
- Parameters (FP16): Model size (GB) / 2
- Optimizer state (Adam): Model size (GB) / 2
- Activations: (Batch size x Seq length x Hidden dim x Layers)
- Gradient buffers: Model size (GB) / 2
Total VRAM ≈ 4x Model Size (GB) + Batch Memory
Model-Specific Memory
| Model | Size | FP16 VRAM (batch=1) | FP16 VRAM (batch=32) | Quantization Impact |
|---|---|---|---|---|
| Mistral 7B | 7B | 14GB | 28GB | 8-bit: 7GB, 4-bit: 3.5GB |
| Llama 2 13B | 13B | 26GB | 52GB+ | 8-bit: 13GB, 4-bit: 6.5GB |
| Llama 2 70B | 70B | 140GB | 260GB+ | 8-bit: 70GB, 4-bit: 35GB |
| Llama 3 405B | 405B | 810GB | 1,200GB+ | 8-bit: 405GB, 4-bit: 202.5GB |
Single GPU Training Limits
| GPU | VRAM | Recommended Models |
|---|---|---|
| RTX 4090 | 24GB | Mistral 7B (quantized) |
| A10 | 24GB | Mistral 7B (quantized) |
| A100 40GB | 40GB | Llama 2 7B (batch=2-4) |
| A100 80GB | 80GB | Mistral 7B (batch=32) or Llama 2 13B (batch=8) |
| H100 80GB | 80GB | Llama 2 13B (batch=16) or Llama 2 70B (with techniques) |
| H200 141GB | 141GB | Llama 2 70B (batch=16) |
Distributed Training Strategies
Strategy 1: Data Parallel (DP)
Replicate model across GPUs, split batch.
When to use: 2-8 GPUs, models <70B parameters
Memory requirement: Same as single GPU (model fits on one GPU)
Speedup: Near-linear (85-95% efficiency with 8 GPUs)
Example: 4xA100 for Mistral 7B training
- Batch size per GPU: 32
- Total batch: 128
- Training time: 2-3 days for 100B tokens
Strategy 2: Tensor Parallelism (TP)
Shard model layers across GPUs.
When to use: 8-64 GPUs, models 70B+
Memory requirement: Model size / Number of GPUs
Speedup: 70-85% efficiency (communication overhead grows)
Example: 8xH100 for Llama 2 70B
- 70B model / 8 GPUs = 8.75B per GPU
- Memory per GPU: 35GB (includes activations/optimizer)
- Training time: 5-7 days for 100B tokens
Strategy 3: Pipeline Parallelism (PP)
Partition model by layers across GPUs.
When to use: 16-128 GPUs, models 70B+
Memory requirement: Model size / Number of GPUs + activation buffers
Speedup: 50-70% efficiency (bubble overhead)
Example: 16xH100 for Llama 3 405B
- 405B model / 16 GPUs = 25.3B per GPU
- Memory per GPU: 80GB
- Training time: 10-15 days for 100B tokens
Strategy 4: Fully Sharded Data Parallel (FSDP)
Combine data, tensor, and pipeline parallelism. PyTorch native.
When to use: 8-256 GPUs, any model size
Memory requirement: Model size / Number of GPUs + per-GPU batch activations
Speedup: 75-90% efficiency
Example: 32xH100 for Llama 3 405B
- 405B model / 32 GPUs = 12.7B per GPU
- Memory per GPU: 80GB
- Training time: 3-5 days for 100B tokens
Training Time Calculations
Base Formula
Training time (hours) = (Model size x Dataset size x 3) / (GPUs x GPU TFLOPS x Utilization)
Simplified calculation: Training time ≈ (Tokens / (Tokens per second per GPU))
Tokens Per Second (TPS) by GPU
| GPU | Single GPU | Notes |
|---|---|---|
| H100 | 3,500 TPS | With tensor parallelism |
| A100 | 2,500 TPS | SXM variant recommended |
| H200 | 4,000 TPS | Newest, faster memory |
| A10 | 1,200 TPS | Consumer-grade, not recommended for production |
| RTX 4090 | 1,500 TPS | Good for 7B models only |
Example: Training Llama 2 70B
Scenario: 1 trillion token training, batch size 2,048
Calculation:
- Model: 70B parameters
- Dataset: 1 trillion tokens
- Batch size: 2,048 (distributed across GPUs)
- Tokens per GPU second: 2,500 (A100) or 3,500 (H100)
Option 1: 8xA100
- Total throughput: 8 x 2,500 = 20,000 tokens/second
- Training time: 1,000,000,000,000 / 20,000 = 50 million seconds
- Hours: 13,889 hours = 579 days (continuous)
- Cost (at $1.39/hour): $19,304
Option 2: 8xH100
- Total throughput: 8 x 3,500 = 28,000 tokens/second
- Training time: 36 million seconds = 10,000 hours = 417 days
- Cost (at $2.69/hour): $26,900
Option 3: 64xH100 (Pipeline + Tensor Parallel)
- Total throughput: 64 x 3,500 = 224,000 tokens/second
- Efficiency: 75% (parallelism overhead)
- Effective throughput: 168,000 tokens/second
- Training time: 6 million seconds = 1,667 hours = 69 days
- Cost (at $2.69/hour): $44,863
Cost Analysis by Model Size
Mistral 7B Training (100B tokens)
| Config | GPUs | TPS | Days | Cost/Day | Total Cost |
|---|---|---|---|---|---|
| 1xH100 | 1 | 3.5K | 262 | $64 | $16,768 |
| 2xH100 | 2 | 6.5K | 142 | $129 | $18,318 |
| 4xH100 | 4 | 12K | 73 | $259 | $18,907 |
Llama 2 13B Training (100B tokens)
| Config | GPUs | TPS | Days | Cost/Day | Total Cost |
|---|---|---|---|---|---|
| 1xH100 | 1 | 3.2K | 283 | $64 | $18,112 |
| 4xH100 | 4 | 11K | 73 | $259 | $18,907 |
| 8xH100 | 8 | 20K | 41 | $518 | $21,238 |
Llama 2 70B Training (1T tokens)
| Config | GPUs | TPS | Days | Cost/Day | Total Cost |
|---|---|---|---|---|---|
| 8xH100 | 8 | 28K | 417 | $518 | $216,006 |
| 16xH100 | 16 | 50K | 231 | $1,036 | $239,316 |
| 32xH100 | 32 | 90K | 130 | $2,072 | $269,360 |
| 64xH100 | 64 | 168K | 69 | $4,144 | $285,936 |
Costs computed using RunPod H100 SXM pricing ($2.69/hour).
GPU Recommendation Matrix
| Model | Dataset | Recommended GPUs | Alternative | Notes |
|---|---|---|---|---|
| Mistral 7B | 50B tokens | 2xH100 | 1xA100 | Minimal viable |
| Mistral 7B | 300B tokens | 4xH100 | 8xA100 | Research-grade |
| Llama 2 7B | 100B tokens | 1xH100 | 2xA100 | Single-GPU feasible |
| Llama 2 13B | 100B tokens | 4xH100 | 8xA100 | Tensor parallelism helps |
| Llama 2 70B | 1T tokens | 32xH100 | 64xA100 | Distributed essential |
| Llama 3 405B | 15T tokens | 256xH100 | 512xA100 | Enterprise-scale |
Optimization Techniques
Technique 1: Gradient Accumulation
Train with smaller batches, accumulate gradients across steps.
Impact: Reduces per-GPU batch size by 4-16x while maintaining effective batch size
Memory savings: 25-50%
Example:
- Without accumulation: Batch 32, memory 80GB
- With accumulation: Batch 8 (accumulated 32 steps), memory 40GB
- Same final batch size, same convergence, half the memory
Trade-off: Training step time increases 10-30% (overhead)
Technique 2: Mixed Precision (FP32 + FP16)
Train in FP16, maintain select tensors in FP32.
Impact: 2x memory savings, minimal quality loss
Implementation: Automatic via torch.cuda.amp
Example:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Quality loss: <1% for most models at full precision equivalence
Technique 3: Flash Attention
Memory-efficient attention implementation.
Impact: 50-60% memory reduction for attention, 20-30% speedup
Memory reduction: Attention size = (Batch x Seq length^2 x Hidden) reduced by 60%
Example: 2K context window training
- Standard attention: 1.6GB per GPU
- Flash Attention: 0.64GB per GPU
- Savings per GPU: 960MB x 8 = 7.7GB cluster-wide
Technique 4: Activation Checkpointing
Recompute activations instead of storing.
Impact: 30-40% memory reduction, 20% slowdown
Trade-off: CPU time vs GPU memory (recompute activations during backward pass)
Use when: Memory-constrained (single node training with high batch size)
Skip when: Training at scale (multi-node, high communication overhead makes recomputation uneconomical)
FAQ
Can I train a 70B model on a single H100? Technically yes, but extremely slowly. Single H100 would require 12+ months continuous training for 1T tokens. Practical minimum: 8xH100 for 70B models in reasonable timeline (3-6 months).
What's the minimum number of GPUs for distributed training to make sense? At least 8 GPUs. Below that, communication overhead exceeds parallelism benefits. 8-16 GPUs provides 85%+ efficiency.
Does distributed training cost more than single-GPU training? No. 8 GPUs for 50 days is cheaper than 1 GPU for 400 days, despite higher per-day cost. Faster completion reduces total investment.
Can I mix different GPU types (H100 + A100) in distributed training? Possible but inefficient. Slower GPU becomes bottleneck. All GPUs must have similar performance for >85% efficiency.
How much does dataset quality affect GPU count needed? High-quality data trains converge 20-30% faster (fewer tokens needed for target loss). Poor data requires 20-30% more tokens. GPU count stays same, but training time changes.
Is it better to train one big model or many small models? One large model (70B) is more efficient than multiple 13B models on same cluster. Single large model amortizes distributed training overhead better.
Can I reduce GPU count by using longer training time? Yes, but with diminishing returns. 1 GPU for Llama 70B = 2 years continuous. 2 GPUs = 1 year. 4 GPUs = 6 months. Beyond 32-64 GPUs, other bottlenecks (I/O, network, data loading) dominate.
What's the cost to train a small fine-tuned 7B model? Fine-tuning Mistral 7B on 100K examples (1B tokens) on single H100: $64 x 3 days = $192. Practical option for startups.
Related Resources
- Complete GPU Pricing Guide
- Best GPU for LLM Training
- Cheapest GPT-4 Alternative
- Fine-Tuning Guide
- RunPod GPU Pricing
- H100 Specifications and Benchmarks
- A100 Specifications