How Many GPUs Do You Need to Train an LLM?

Deploybase · November 12, 2025 · LLM Guides

Contents

How Many Gpus Train LLM: How Many GPUs Do Developers Need

How many gpus to train llm depends on model size, context length, batch size, and data. Single H100 trains 7B-13B fine. 70B+ needs 8-64 GPUs distributed. This breaks down the math: memory, time, and cost as of March 2026.

GPU Memory Requirements

Memory Formula

Training memory = Parameters + Optimizer State + Activations + Gradient Buffers

Rough calculation:

  • Parameters (FP16): Model size (GB) / 2
  • Optimizer state (Adam): Model size (GB) / 2
  • Activations: (Batch size x Seq length x Hidden dim x Layers)
  • Gradient buffers: Model size (GB) / 2

Total VRAM ≈ 4x Model Size (GB) + Batch Memory

Model-Specific Memory

ModelSizeFP16 VRAM (batch=1)FP16 VRAM (batch=32)Quantization Impact
Mistral 7B7B14GB28GB8-bit: 7GB, 4-bit: 3.5GB
Llama 2 13B13B26GB52GB+8-bit: 13GB, 4-bit: 6.5GB
Llama 2 70B70B140GB260GB+8-bit: 70GB, 4-bit: 35GB
Llama 3 405B405B810GB1,200GB+8-bit: 405GB, 4-bit: 202.5GB

Single GPU Training Limits

GPUVRAMRecommended Models
RTX 409024GBMistral 7B (quantized)
A1024GBMistral 7B (quantized)
A100 40GB40GBLlama 2 7B (batch=2-4)
A100 80GB80GBMistral 7B (batch=32) or Llama 2 13B (batch=8)
H100 80GB80GBLlama 2 13B (batch=16) or Llama 2 70B (with techniques)
H200 141GB141GBLlama 2 70B (batch=16)

Distributed Training Strategies

Strategy 1: Data Parallel (DP)

Replicate model across GPUs, split batch.

When to use: 2-8 GPUs, models <70B parameters

Memory requirement: Same as single GPU (model fits on one GPU)

Speedup: Near-linear (85-95% efficiency with 8 GPUs)

Example: 4xA100 for Mistral 7B training

  • Batch size per GPU: 32
  • Total batch: 128
  • Training time: 2-3 days for 100B tokens

Strategy 2: Tensor Parallelism (TP)

Shard model layers across GPUs.

When to use: 8-64 GPUs, models 70B+

Memory requirement: Model size / Number of GPUs

Speedup: 70-85% efficiency (communication overhead grows)

Example: 8xH100 for Llama 2 70B

  • 70B model / 8 GPUs = 8.75B per GPU
  • Memory per GPU: 35GB (includes activations/optimizer)
  • Training time: 5-7 days for 100B tokens

Strategy 3: Pipeline Parallelism (PP)

Partition model by layers across GPUs.

When to use: 16-128 GPUs, models 70B+

Memory requirement: Model size / Number of GPUs + activation buffers

Speedup: 50-70% efficiency (bubble overhead)

Example: 16xH100 for Llama 3 405B

  • 405B model / 16 GPUs = 25.3B per GPU
  • Memory per GPU: 80GB
  • Training time: 10-15 days for 100B tokens

Strategy 4: Fully Sharded Data Parallel (FSDP)

Combine data, tensor, and pipeline parallelism. PyTorch native.

When to use: 8-256 GPUs, any model size

Memory requirement: Model size / Number of GPUs + per-GPU batch activations

Speedup: 75-90% efficiency

Example: 32xH100 for Llama 3 405B

  • 405B model / 32 GPUs = 12.7B per GPU
  • Memory per GPU: 80GB
  • Training time: 3-5 days for 100B tokens

Training Time Calculations

Base Formula

Training time (hours) = (Model size x Dataset size x 3) / (GPUs x GPU TFLOPS x Utilization)

Simplified calculation: Training time ≈ (Tokens / (Tokens per second per GPU))

Tokens Per Second (TPS) by GPU

GPUSingle GPUNotes
H1003,500 TPSWith tensor parallelism
A1002,500 TPSSXM variant recommended
H2004,000 TPSNewest, faster memory
A101,200 TPSConsumer-grade, not recommended for production
RTX 40901,500 TPSGood for 7B models only

Example: Training Llama 2 70B

Scenario: 1 trillion token training, batch size 2,048

Calculation:

  • Model: 70B parameters
  • Dataset: 1 trillion tokens
  • Batch size: 2,048 (distributed across GPUs)
  • Tokens per GPU second: 2,500 (A100) or 3,500 (H100)

Option 1: 8xA100

  • Total throughput: 8 x 2,500 = 20,000 tokens/second
  • Training time: 1,000,000,000,000 / 20,000 = 50 million seconds
  • Hours: 13,889 hours = 579 days (continuous)
  • Cost (at $1.39/hour): $19,304

Option 2: 8xH100

  • Total throughput: 8 x 3,500 = 28,000 tokens/second
  • Training time: 36 million seconds = 10,000 hours = 417 days
  • Cost (at $2.69/hour): $26,900

Option 3: 64xH100 (Pipeline + Tensor Parallel)

  • Total throughput: 64 x 3,500 = 224,000 tokens/second
  • Efficiency: 75% (parallelism overhead)
  • Effective throughput: 168,000 tokens/second
  • Training time: 6 million seconds = 1,667 hours = 69 days
  • Cost (at $2.69/hour): $44,863

Cost Analysis by Model Size

Mistral 7B Training (100B tokens)

ConfigGPUsTPSDaysCost/DayTotal Cost
1xH10013.5K262$64$16,768
2xH10026.5K142$129$18,318
4xH100412K73$259$18,907

Llama 2 13B Training (100B tokens)

ConfigGPUsTPSDaysCost/DayTotal Cost
1xH10013.2K283$64$18,112
4xH100411K73$259$18,907
8xH100820K41$518$21,238

Llama 2 70B Training (1T tokens)

ConfigGPUsTPSDaysCost/DayTotal Cost
8xH100828K417$518$216,006
16xH1001650K231$1,036$239,316
32xH1003290K130$2,072$269,360
64xH10064168K69$4,144$285,936

Costs computed using RunPod H100 SXM pricing ($2.69/hour).

GPU Recommendation Matrix

ModelDatasetRecommended GPUsAlternativeNotes
Mistral 7B50B tokens2xH1001xA100Minimal viable
Mistral 7B300B tokens4xH1008xA100Research-grade
Llama 2 7B100B tokens1xH1002xA100Single-GPU feasible
Llama 2 13B100B tokens4xH1008xA100Tensor parallelism helps
Llama 2 70B1T tokens32xH10064xA100Distributed essential
Llama 3 405B15T tokens256xH100512xA100Enterprise-scale

Optimization Techniques

Technique 1: Gradient Accumulation

Train with smaller batches, accumulate gradients across steps.

Impact: Reduces per-GPU batch size by 4-16x while maintaining effective batch size

Memory savings: 25-50%

Example:

  • Without accumulation: Batch 32, memory 80GB
  • With accumulation: Batch 8 (accumulated 32 steps), memory 40GB
  • Same final batch size, same convergence, half the memory

Trade-off: Training step time increases 10-30% (overhead)

Technique 2: Mixed Precision (FP32 + FP16)

Train in FP16, maintain select tensors in FP32.

Impact: 2x memory savings, minimal quality loss

Implementation: Automatic via torch.cuda.amp

Example:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

with autocast():
    output = model(input)
    loss = criterion(output, target)

scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Quality loss: <1% for most models at full precision equivalence

Technique 3: Flash Attention

Memory-efficient attention implementation.

Impact: 50-60% memory reduction for attention, 20-30% speedup

Memory reduction: Attention size = (Batch x Seq length^2 x Hidden) reduced by 60%

Example: 2K context window training

  • Standard attention: 1.6GB per GPU
  • Flash Attention: 0.64GB per GPU
  • Savings per GPU: 960MB x 8 = 7.7GB cluster-wide

Technique 4: Activation Checkpointing

Recompute activations instead of storing.

Impact: 30-40% memory reduction, 20% slowdown

Trade-off: CPU time vs GPU memory (recompute activations during backward pass)

Use when: Memory-constrained (single node training with high batch size)

Skip when: Training at scale (multi-node, high communication overhead makes recomputation uneconomical)

FAQ

Can I train a 70B model on a single H100? Technically yes, but extremely slowly. Single H100 would require 12+ months continuous training for 1T tokens. Practical minimum: 8xH100 for 70B models in reasonable timeline (3-6 months).

What's the minimum number of GPUs for distributed training to make sense? At least 8 GPUs. Below that, communication overhead exceeds parallelism benefits. 8-16 GPUs provides 85%+ efficiency.

Does distributed training cost more than single-GPU training? No. 8 GPUs for 50 days is cheaper than 1 GPU for 400 days, despite higher per-day cost. Faster completion reduces total investment.

Can I mix different GPU types (H100 + A100) in distributed training? Possible but inefficient. Slower GPU becomes bottleneck. All GPUs must have similar performance for >85% efficiency.

How much does dataset quality affect GPU count needed? High-quality data trains converge 20-30% faster (fewer tokens needed for target loss). Poor data requires 20-30% more tokens. GPU count stays same, but training time changes.

Is it better to train one big model or many small models? One large model (70B) is more efficient than multiple 13B models on same cluster. Single large model amortizes distributed training overhead better.

Can I reduce GPU count by using longer training time? Yes, but with diminishing returns. 1 GPU for Llama 70B = 2 years continuous. 2 GPUs = 1 year. 4 GPUs = 6 months. Beyond 32-64 GPUs, other bottlenecks (I/O, network, data loading) dominate.

What's the cost to train a small fine-tuned 7B model? Fine-tuning Mistral 7B on 100K examples (1B tokens) on single H100: $64 x 3 days = $192. Practical option for startups.

Sources