Contents
- LLM VRAM Requirements: Overview
- Quick Reference Table
- Inference Memory Calculation Formulas
- Training Memory Calculation Formulas
- Quantization Comparison
- KV Cache and Batch Size Impact
- Popular Model VRAM Breakdown
- Multi-GPU Considerations
- Cost Implications
- FAQ
- Related Resources
- Sources
LLM VRAM Requirements: Overview
LLM VRAM requirements depend on three variables: model size in parameters, quantization method, and usage pattern (inference vs training), as of March 2026.
The core formula: VRAM (GB) = (Parameters × Bytes per Parameter) / 1 billion
Llama 3 70B uncompressed in full precision (fp32) = (70B × 4 bytes) / 1B = 280GB. That's before KV cache, workspace buffers, or attention tensors. Impractical.
In production, no team runs full precision. int8 quantization cuts this to 140GB. int4 cuts it further to 70GB. For inference, KV cache adds 20-40% overhead depending on batch size and sequence length. For training, optimizer state (Adam or AdamW) can double or triple memory usage.
The goal: fit the model in available VRAM with headroom for intermediate tensors. Run out of memory and training crashes or inference latency spikes catastrophically.
Quick Reference Table
Inference (single batch, seq_len=2048 tokens)
| Model | Params | FP32 | FP16 | int8 | int4 |
|---|---|---|---|---|---|
| Llama 3 8B | 8B | 32GB | 16GB | 8GB | 4GB |
| Mistral 7B | 7B | 28GB | 14GB | 7GB | 3.5GB |
| Llama 3 70B | 70B | 280GB | 140GB | 70GB | 35GB |
| Llama 3.1 405B | 405B | 1,620GB | 810GB | 405GB | 202.5GB |
| Qwen 72B | 72B | 288GB | 144GB | 72GB | 36GB |
Add KV cache: 15-30% depending on batch size and sequence length. These are model weights only.
Training (full fine-tuning, batch_size=4)
| Model | Params | Optimizer | Total with Optimizer | Multi-GPU (H100 SXM) |
|---|---|---|---|---|
| Llama 3 8B | 8B | Adam | ~96GB | Not needed (fits 1x) |
| Llama 3 70B | 70B | Adam | ~560GB | 8x H100 SXM (640GB) |
| Llama 3.1 405B | 405B | Adam | ~3.2TB | 8+ GPUs, sharding required |
Gradient checkpointing enabled (standard practice). Without it, multiply by 1.5-2x.
Inference Memory Calculation Formulas
The formula for inference VRAM is:
VRAM = Model Weights + KV Cache + Workspace Buffers
Model Weight Memory
Memory = (Parameters × Bytes per Parameter) / 1 billion
| Quantization | Bits per Param | Bytes per Param | Llama 3 70B Example |
|---|---|---|---|
| FP32 (full precision) | 32 | 4 | 280GB |
| FP16 / BF16 (half precision) | 16 | 2 | 140GB |
| int8 (8-bit quantization) | 8 | 1 | 70GB |
| int4 with grouping | 4 | 0.5 | 35GB |
| int2 (extreme, rare) | 2 | 0.25 | 17.5GB |
Example calculation: Llama 3 70B in int8.
- Parameters: 70 billion
- Bytes per parameter: 1
- Memory: (70B × 1) / 1B = 70GB
KV Cache Memory
KV cache stores intermediate key and value tensors used during token generation.
KV Cache = 2 × (batch_size × sequence_length × hidden_dim × bytes_per_param)
For Llama 3 70B with hidden_dim=8192:
| Batch | Seq Length | KV Cache Size | Total with Model |
|---|---|---|---|
| 1 | 2048 | 64MB | 70.06GB |
| 4 | 2048 | 256MB | 70.25GB |
| 8 | 2048 | 512MB | 70.5GB |
| 16 | 2048 | 1GB | 71GB |
| 32 | 2048 | 2GB | 72GB |
KV cache grows linearly with batch size. At batch 64, add 4GB. Teams serving many concurrent requests face KV cache explosions.
Workspace and Temporary Buffers
Activations, attention computations, and temporary buffers during inference: roughly 10-20% of model weight memory.
Example: Llama 3 70B in int8.
- Model weights: 70GB
- Workspace (15%): 10.5GB
- KV cache (batch 4): 0.25GB
- Total: ~80.75GB
Fits on a single H100 PCIe (80GB) with minimal headroom. Batch larger than 8 and VRAM exceeds 80GB.
Training Memory Calculation Formulas
Training is memory-intensive. Must track:
- Model weights (fp16): 140GB for Llama 3 70B
- Gradients (same size as weights): 140GB
- Optimizer state (Adam: 2x weights): 280GB
- Activations (with gradient checkpointing): 70GB
Full Training VRAM = Weights + Gradients + Optimizer + Activations
For Llama 3 70B with Adam optimizer:
- Weights (fp16): 140GB
- Gradients (fp16): 140GB
- Optimizer state (moment, variance): 280GB
- Activations (gradient checkpointing enabled): 70GB
- Total: 630GB
This requires 8x H100 GPUs (8 × 80GB = 640GB) just to hold the model and optimizer.
Gradient Checkpointing
Recompute activations during backward pass instead of storing them. Trade compute (recomputation) for memory (less storage).
Impact: Reduces activation memory 5-10x. Increases training time 20-40% due to recomputation overhead.
Most teams use gradient checkpointing by default. It's a standard trade-off.
Memory-Efficient Training Techniques
LoRA (Low-Rank Adaptation):
- Only fine-tune small adapter weights, not full model
- Base model in fp16: 140GB
- LoRA adapters: 2-4GB
- Optimizer state for adapters only: 4-8GB
- Total: ~150GB for Llama 3 70B
Fit on 2x H100 SXM instead of 8x. Massive cost savings.
QLoRA (Quantized LoRA):
- Load base model in int4: 35GB
- LoRA adapters in fp16: 2-4GB
- Optimizer for adapters: 2-4GB
- Total: ~40-45GB for Llama 3 70B
Fits on a single A100 PCIe or H100. Extreme cost efficiency at slight quality cost.
DeepSpeed Zero-3:
- Partition optimizer state, gradients, and weights across GPUs
- With 8 GPUs: 630GB / 8 ≈ 79GB per GPU
Enables training on consumer-grade GPU clusters by distributing memory horizontally across devices.
Quantization Comparison
Quantization trades precision for memory and speed.
int8 Quantization (1 byte per parameter)
Memory: 50% of fp16. Quality loss: negligible for inference. Speed: slightly faster due to smaller model.
Example: Llama 3 70B in int8 = 70GB. Fits on single H100 with room for KV cache and batching.
Suitable for production inference. Standard choice for inference-only deployments.
int4 Quantization (0.5 bytes per parameter)
Memory: 25% of fp16. Quality loss: detectable but acceptable on most tasks. Speed: faster memory bandwidth, often faster inference than int8.
Example: Llama 3 70B in int4 = 35GB. Fits comfortably on any H100 or A100 with massive KV cache headroom.
Trade: slight quality drop (1-3% accuracy loss on benchmarks). Worth it for cost/speed gains.
GPTQ (Grouped Quantization)
Quantizes weights in groups, not globally. Better quality than naive int4.
Memory: ~35GB for Llama 3 70B. Speed: slower than int8 but better accuracy than naive int4.
Tooling: HuggingFace Transformers includes GPTQ loaders. vLLM supports GPTQ natively.
AWQ (Activation-Weighted Quantization)
Prioritizes important weight values. Best int4 quality.
Memory: ~35GB for Llama 3 70B. Speed: comparable to GPTQ.
Tooling: good support in vLLM and SGLang. Slightly faster than GPTQ on hardware with custom kernels.
Real-World Example: Llama 3 70B on H100 PCIe (80GB)
- FP16: 140GB (doesn't fit, OOM immediately)
- int8: 70GB weights + 10GB cache/workspace = fits
- int4 (GPTQ): 35GB weights + 10GB overhead = fits with 45GB headroom for large batches
The choice: int8 for quality, int4 for throughput. Both fit on single H100.
KV Cache and Batch Size Impact
Inference Batch Size Effects
Batch size dominates inference VRAM usage and throughput.
Llama 3 70B on H100 PCIe (80GB total, int8 weights)
| Batch | KV Cache | Total VRAM | Practical? | Risk |
|---|---|---|---|---|
| 1 | 64MB | 70.1GB | Yes | 10GB headroom |
| 4 | 256MB | 70.3GB | Yes | 9.7GB headroom |
| 16 | 1GB | 71GB | Tight | 9GB headroom |
| 32 | 2GB | 72GB | Risky | May OOM |
| 64+ | 4GB+ | 74GB+ | OOM | Will crash |
Batch 32+ causes OOM on single H100. Workaround: continuous batching or sequence packing.
Continuous batching: don't wait for all tokens in a batch to complete. Process tokens individually, scheduling new work as old completes. KV cache is shared across in-flight tokens. Effective batch drops from 32 to 4-8 without exceeding VRAM.
Training Batch Size Effects
Training uses gradient checkpointing. Batch size doesn't change per-GPU VRAM in distributed training (DeepSpeed handles sharding). Instead, batch size affects gradient noise and convergence.
Typical sweet spot: batch 4-8 per GPU for 70B models on H100 clusters.
| Batch | Per-GPU Memory | H100 Count for Llama 70B |
|---|---|---|
| 1 | ~630GB unsharded | 8x |
| 4 | ~630GB unsharded | 8x |
| 8 | ~630GB unsharded | 8x |
Batch size scales gradient accumulation steps, not per-GPU VRAM.
Popular Model VRAM Breakdown
Llama 3 8B
- Weights (int8): 8GB
- Inference KV cache (batch 4): 0.25GB
- Total inference: 8.25GB (fits A10, L40, A6000, H100)
- Training (single GPU, gradient checkpointing): ~96GB
Mistral 7B
- Weights (int8): 7GB
- Inference KV cache (batch 4): 0.2GB
- Total inference: 7.2GB (fits A10, RTX 4090, L40)
- Training (single GPU): ~84GB with gradient checkpointing
Llama 3 70B
- Weights (int8): 70GB
- Inference KV cache (batch 4): 0.25GB
- Total inference: 70.25GB (fits H100 PCIe, L40S, GH200)
- Training: ~630GB (8x H100 SXM recommended)
See GPU models for cloud provider pricing on suitable hardware.
Llama 3 405B
- Weights (int8): 405GB
- Inference KV cache (batch 4): 1.5GB
- Total inference: 406.5GB (requires 5-6x H100 SXM or MI300X multi-GPU)
- Training: impossible on current single-GPU hardware. Requires distributed training at scale.
Qwen 72B
- Weights (int8): 72GB
- Inference KV cache (batch 4): 0.3GB
- Total inference: 72.3GB (fits H100, close to limit)
- Training: ~570GB (8x H100 SXM)
GPT-4 (Estimated)
Specifications are not public. Assumed ~1.76T parameters (mixture-of-experts architecture).
- Full-precision inference: ~7TB (impossible on cloud hardware)
- int4 inference: ~440GB (requires 5-6x MI300X or 8-10x H100 SXM)
- Training: not available for cloud rental (proprietary)
Multi-GPU Considerations
Tensor Parallelism
Split model across multiple GPUs. Each GPU holds 1/N of the model. Requires fast inter-GPU communication (NVLink or high-speed Ethernet).
Example: Llama 3 70B across 2x H100 SXM.
- Per-GPU model size: 35GB (int8)
- Per-GPU KV cache (batch 8): 0.5GB
- Per-GPU total: 35.5GB (fits comfortably)
Requires NVLink for <1ms inter-GPU latency. Standard in professional GPU clusters.
Pipeline Parallelism
Split model into sequential stages, each on a different GPU. Slower than tensor parallelism due to pipeline bubbles, but works on heterogeneous hardware.
Less common in practice. Tensor parallelism is preferred.
Data Parallelism with DeepSpeed
Shard optimizer state, gradients, and weights across GPUs. Each GPU holds partial weights.
Example: Llama 3 70B training across 8x H100 with DeepSpeed Zero-3.
- Per-GPU unsharded memory: 630GB / 8 ≈ 79GB
- Fits on each H100 (80GB) with 1GB headroom
Standard approach for distributed fine-tuning. Requires Hugging Face Transformers + DeepSpeed integration.
Cost Implications
VRAM drives GPU selection, which drives cost.
Llama 3 70B Inference Cost
Serving 1M tokens/day, batch 4:
Option 1: Dual H100 PCIe (for model sharding)
- Cost: 2 × $1.99/hr = $3.98/hr
- 30% utilization: 30% × $3.98 × 730 hrs/month = $870/month
- Per token: $870 / (30M tokens/month) = $0.000029/token
Option 2: Single MI300X (fits Llama 70B int8 without sharding)
- Cost: $3.49/hr (RunPod pricing)
- 50% utilization: 50% × $3.49 × 730 = $1,274/month
- Per token: $1,274 / (50M tokens/month) = $0.000025/token
MI300X's extra memory eliminates dual-GPU overhead. At equivalent throughput, the larger card is cheaper.
Fine-Tuning Llama 3 70B with LoRA
LoRA approach (parameter-efficient):
- Llama 3 70B int4 (QLoRA): 45GB
- Single H100 on RunPod: $1.99/hr × 10 hours training = $19.90
vs full fine-tuning:
- 8x H100 on RunPod: $2.69 × 8 × 10 = $215.20
LoRA saves $195 (91% cheaper) while maintaining near-original quality.
FAQ
What happens if I run out of VRAM?
PyTorch throws RuntimeError: CUDA out of memory. Inference crashes immediately. Training loses all progress (unless using checkpointing for emergency swap to CPU RAM, which is 10-100x slower). Prevention: calculate VRAM upfront, test with smaller batch sizes, enable gradient checkpointing.
Should we use mixed precision (fp16) or quantization? For inference: quantization (int8 or int4) is preferable. Smaller model, faster memory bandwidth, negligible quality loss. For training: mixed precision (fp16 + fp32) is standard. Faster than fp32 alone, maintains numerical stability better than pure fp16.
Can Llama 2 70B fit on an A100? A100 has 80GB HBM2e. Llama 2 70B (older than Llama 3) in FP16: 140GB (doesn't fit). int8: 70GB (fits with 10GB headroom, tight for batching). Training: impossible (requires 560GB+ with optimizer).
What's the difference between KV cache and activations? KV cache: stored intermediate key/value tensors used during token generation. Grows with batch and sequence length. Activations: temporary tensors computed during forward pass. Recomputed during backward (gradient checkpointing) or stored (memory-hungry, faster).
How do I estimate VRAM for a custom model?
Count parameters: model.num_parameters() in PyTorch. Multiply by bytes per parameter (0.5 for int4, 1 for int8, 2 for fp16, 4 for fp32). Add 20% for workspace and activations. For training, multiply by 3-4 (gradients + optimizer + activations).
Can we use CPU RAM as swap for VRAM? Yes, but catastrophically slow. Useful for development, not production. SwiftTransformer and DeepSpeed support offloading, but throughput drops 10-50x depending on PCIe bandwidth. Only use in emergency or development.
Related Resources
- GPU Hardware Specifications and Pricing
- GPU vs CPU for AI Workloads
- What is a Cloud GPU
- AI Tokens Explained