LLM VRAM Requirements: How Much GPU Memory for AI Models?

Deploybase · July 28, 2025 · AI Infrastructure

Contents


LLM VRAM Requirements: Overview

LLM VRAM requirements depend on three variables: model size in parameters, quantization method, and usage pattern (inference vs training), as of March 2026.

The core formula: VRAM (GB) = (Parameters × Bytes per Parameter) / 1 billion

Llama 3 70B uncompressed in full precision (fp32) = (70B × 4 bytes) / 1B = 280GB. That's before KV cache, workspace buffers, or attention tensors. Impractical.

In production, no team runs full precision. int8 quantization cuts this to 140GB. int4 cuts it further to 70GB. For inference, KV cache adds 20-40% overhead depending on batch size and sequence length. For training, optimizer state (Adam or AdamW) can double or triple memory usage.

The goal: fit the model in available VRAM with headroom for intermediate tensors. Run out of memory and training crashes or inference latency spikes catastrophically.


Quick Reference Table

Inference (single batch, seq_len=2048 tokens)

ModelParamsFP32FP16int8int4
Llama 3 8B8B32GB16GB8GB4GB
Mistral 7B7B28GB14GB7GB3.5GB
Llama 3 70B70B280GB140GB70GB35GB
Llama 3.1 405B405B1,620GB810GB405GB202.5GB
Qwen 72B72B288GB144GB72GB36GB

Add KV cache: 15-30% depending on batch size and sequence length. These are model weights only.

Training (full fine-tuning, batch_size=4)

ModelParamsOptimizerTotal with OptimizerMulti-GPU (H100 SXM)
Llama 3 8B8BAdam~96GBNot needed (fits 1x)
Llama 3 70B70BAdam~560GB8x H100 SXM (640GB)
Llama 3.1 405B405BAdam~3.2TB8+ GPUs, sharding required

Gradient checkpointing enabled (standard practice). Without it, multiply by 1.5-2x.


Inference Memory Calculation Formulas

The formula for inference VRAM is:

VRAM = Model Weights + KV Cache + Workspace Buffers

Model Weight Memory

Memory = (Parameters × Bytes per Parameter) / 1 billion

QuantizationBits per ParamBytes per ParamLlama 3 70B Example
FP32 (full precision)324280GB
FP16 / BF16 (half precision)162140GB
int8 (8-bit quantization)8170GB
int4 with grouping40.535GB
int2 (extreme, rare)20.2517.5GB

Example calculation: Llama 3 70B in int8.

  • Parameters: 70 billion
  • Bytes per parameter: 1
  • Memory: (70B × 1) / 1B = 70GB

KV Cache Memory

KV cache stores intermediate key and value tensors used during token generation.

KV Cache = 2 × (batch_size × sequence_length × hidden_dim × bytes_per_param)

For Llama 3 70B with hidden_dim=8192:

BatchSeq LengthKV Cache SizeTotal with Model
1204864MB70.06GB
42048256MB70.25GB
82048512MB70.5GB
1620481GB71GB
3220482GB72GB

KV cache grows linearly with batch size. At batch 64, add 4GB. Teams serving many concurrent requests face KV cache explosions.

Workspace and Temporary Buffers

Activations, attention computations, and temporary buffers during inference: roughly 10-20% of model weight memory.

Example: Llama 3 70B in int8.

  • Model weights: 70GB
  • Workspace (15%): 10.5GB
  • KV cache (batch 4): 0.25GB
  • Total: ~80.75GB

Fits on a single H100 PCIe (80GB) with minimal headroom. Batch larger than 8 and VRAM exceeds 80GB.


Training Memory Calculation Formulas

Training is memory-intensive. Must track:

  1. Model weights (fp16): 140GB for Llama 3 70B
  2. Gradients (same size as weights): 140GB
  3. Optimizer state (Adam: 2x weights): 280GB
  4. Activations (with gradient checkpointing): 70GB

Full Training VRAM = Weights + Gradients + Optimizer + Activations

For Llama 3 70B with Adam optimizer:

  • Weights (fp16): 140GB
  • Gradients (fp16): 140GB
  • Optimizer state (moment, variance): 280GB
  • Activations (gradient checkpointing enabled): 70GB
  • Total: 630GB

This requires 8x H100 GPUs (8 × 80GB = 640GB) just to hold the model and optimizer.

Gradient Checkpointing

Recompute activations during backward pass instead of storing them. Trade compute (recomputation) for memory (less storage).

Impact: Reduces activation memory 5-10x. Increases training time 20-40% due to recomputation overhead.

Most teams use gradient checkpointing by default. It's a standard trade-off.

Memory-Efficient Training Techniques

LoRA (Low-Rank Adaptation):

  • Only fine-tune small adapter weights, not full model
  • Base model in fp16: 140GB
  • LoRA adapters: 2-4GB
  • Optimizer state for adapters only: 4-8GB
  • Total: ~150GB for Llama 3 70B

Fit on 2x H100 SXM instead of 8x. Massive cost savings.

QLoRA (Quantized LoRA):

  • Load base model in int4: 35GB
  • LoRA adapters in fp16: 2-4GB
  • Optimizer for adapters: 2-4GB
  • Total: ~40-45GB for Llama 3 70B

Fits on a single A100 PCIe or H100. Extreme cost efficiency at slight quality cost.

DeepSpeed Zero-3:

  • Partition optimizer state, gradients, and weights across GPUs
  • With 8 GPUs: 630GB / 8 ≈ 79GB per GPU

Enables training on consumer-grade GPU clusters by distributing memory horizontally across devices.


Quantization Comparison

Quantization trades precision for memory and speed.

int8 Quantization (1 byte per parameter)

Memory: 50% of fp16. Quality loss: negligible for inference. Speed: slightly faster due to smaller model.

Example: Llama 3 70B in int8 = 70GB. Fits on single H100 with room for KV cache and batching.

Suitable for production inference. Standard choice for inference-only deployments.

int4 Quantization (0.5 bytes per parameter)

Memory: 25% of fp16. Quality loss: detectable but acceptable on most tasks. Speed: faster memory bandwidth, often faster inference than int8.

Example: Llama 3 70B in int4 = 35GB. Fits comfortably on any H100 or A100 with massive KV cache headroom.

Trade: slight quality drop (1-3% accuracy loss on benchmarks). Worth it for cost/speed gains.

GPTQ (Grouped Quantization)

Quantizes weights in groups, not globally. Better quality than naive int4.

Memory: ~35GB for Llama 3 70B. Speed: slower than int8 but better accuracy than naive int4.

Tooling: HuggingFace Transformers includes GPTQ loaders. vLLM supports GPTQ natively.

AWQ (Activation-Weighted Quantization)

Prioritizes important weight values. Best int4 quality.

Memory: ~35GB for Llama 3 70B. Speed: comparable to GPTQ.

Tooling: good support in vLLM and SGLang. Slightly faster than GPTQ on hardware with custom kernels.

Real-World Example: Llama 3 70B on H100 PCIe (80GB)

  • FP16: 140GB (doesn't fit, OOM immediately)
  • int8: 70GB weights + 10GB cache/workspace = fits
  • int4 (GPTQ): 35GB weights + 10GB overhead = fits with 45GB headroom for large batches

The choice: int8 for quality, int4 for throughput. Both fit on single H100.


KV Cache and Batch Size Impact

Inference Batch Size Effects

Batch size dominates inference VRAM usage and throughput.

Llama 3 70B on H100 PCIe (80GB total, int8 weights)

BatchKV CacheTotal VRAMPractical?Risk
164MB70.1GBYes10GB headroom
4256MB70.3GBYes9.7GB headroom
161GB71GBTight9GB headroom
322GB72GBRiskyMay OOM
64+4GB+74GB+OOMWill crash

Batch 32+ causes OOM on single H100. Workaround: continuous batching or sequence packing.

Continuous batching: don't wait for all tokens in a batch to complete. Process tokens individually, scheduling new work as old completes. KV cache is shared across in-flight tokens. Effective batch drops from 32 to 4-8 without exceeding VRAM.

Training Batch Size Effects

Training uses gradient checkpointing. Batch size doesn't change per-GPU VRAM in distributed training (DeepSpeed handles sharding). Instead, batch size affects gradient noise and convergence.

Typical sweet spot: batch 4-8 per GPU for 70B models on H100 clusters.

BatchPer-GPU MemoryH100 Count for Llama 70B
1~630GB unsharded8x
4~630GB unsharded8x
8~630GB unsharded8x

Batch size scales gradient accumulation steps, not per-GPU VRAM.


Llama 3 8B

  • Weights (int8): 8GB
  • Inference KV cache (batch 4): 0.25GB
  • Total inference: 8.25GB (fits A10, L40, A6000, H100)
  • Training (single GPU, gradient checkpointing): ~96GB

Mistral 7B

  • Weights (int8): 7GB
  • Inference KV cache (batch 4): 0.2GB
  • Total inference: 7.2GB (fits A10, RTX 4090, L40)
  • Training (single GPU): ~84GB with gradient checkpointing

Llama 3 70B

  • Weights (int8): 70GB
  • Inference KV cache (batch 4): 0.25GB
  • Total inference: 70.25GB (fits H100 PCIe, L40S, GH200)
  • Training: ~630GB (8x H100 SXM recommended)

See GPU models for cloud provider pricing on suitable hardware.

Llama 3 405B

  • Weights (int8): 405GB
  • Inference KV cache (batch 4): 1.5GB
  • Total inference: 406.5GB (requires 5-6x H100 SXM or MI300X multi-GPU)
  • Training: impossible on current single-GPU hardware. Requires distributed training at scale.

Qwen 72B

  • Weights (int8): 72GB
  • Inference KV cache (batch 4): 0.3GB
  • Total inference: 72.3GB (fits H100, close to limit)
  • Training: ~570GB (8x H100 SXM)

GPT-4 (Estimated)

Specifications are not public. Assumed ~1.76T parameters (mixture-of-experts architecture).

  • Full-precision inference: ~7TB (impossible on cloud hardware)
  • int4 inference: ~440GB (requires 5-6x MI300X or 8-10x H100 SXM)
  • Training: not available for cloud rental (proprietary)

Multi-GPU Considerations

Tensor Parallelism

Split model across multiple GPUs. Each GPU holds 1/N of the model. Requires fast inter-GPU communication (NVLink or high-speed Ethernet).

Example: Llama 3 70B across 2x H100 SXM.

  • Per-GPU model size: 35GB (int8)
  • Per-GPU KV cache (batch 8): 0.5GB
  • Per-GPU total: 35.5GB (fits comfortably)

Requires NVLink for <1ms inter-GPU latency. Standard in professional GPU clusters.

Pipeline Parallelism

Split model into sequential stages, each on a different GPU. Slower than tensor parallelism due to pipeline bubbles, but works on heterogeneous hardware.

Less common in practice. Tensor parallelism is preferred.

Data Parallelism with DeepSpeed

Shard optimizer state, gradients, and weights across GPUs. Each GPU holds partial weights.

Example: Llama 3 70B training across 8x H100 with DeepSpeed Zero-3.

  • Per-GPU unsharded memory: 630GB / 8 ≈ 79GB
  • Fits on each H100 (80GB) with 1GB headroom

Standard approach for distributed fine-tuning. Requires Hugging Face Transformers + DeepSpeed integration.


Cost Implications

VRAM drives GPU selection, which drives cost.

Llama 3 70B Inference Cost

Serving 1M tokens/day, batch 4:

Option 1: Dual H100 PCIe (for model sharding)

  • Cost: 2 × $1.99/hr = $3.98/hr
  • 30% utilization: 30% × $3.98 × 730 hrs/month = $870/month
  • Per token: $870 / (30M tokens/month) = $0.000029/token

Option 2: Single MI300X (fits Llama 70B int8 without sharding)

  • Cost: $3.49/hr (RunPod pricing)
  • 50% utilization: 50% × $3.49 × 730 = $1,274/month
  • Per token: $1,274 / (50M tokens/month) = $0.000025/token

MI300X's extra memory eliminates dual-GPU overhead. At equivalent throughput, the larger card is cheaper.

Fine-Tuning Llama 3 70B with LoRA

LoRA approach (parameter-efficient):

  • Llama 3 70B int4 (QLoRA): 45GB
  • Single H100 on RunPod: $1.99/hr × 10 hours training = $19.90

vs full fine-tuning:

  • 8x H100 on RunPod: $2.69 × 8 × 10 = $215.20

LoRA saves $195 (91% cheaper) while maintaining near-original quality.


FAQ

What happens if I run out of VRAM? PyTorch throws RuntimeError: CUDA out of memory. Inference crashes immediately. Training loses all progress (unless using checkpointing for emergency swap to CPU RAM, which is 10-100x slower). Prevention: calculate VRAM upfront, test with smaller batch sizes, enable gradient checkpointing.

Should we use mixed precision (fp16) or quantization? For inference: quantization (int8 or int4) is preferable. Smaller model, faster memory bandwidth, negligible quality loss. For training: mixed precision (fp16 + fp32) is standard. Faster than fp32 alone, maintains numerical stability better than pure fp16.

Can Llama 2 70B fit on an A100? A100 has 80GB HBM2e. Llama 2 70B (older than Llama 3) in FP16: 140GB (doesn't fit). int8: 70GB (fits with 10GB headroom, tight for batching). Training: impossible (requires 560GB+ with optimizer).

What's the difference between KV cache and activations? KV cache: stored intermediate key/value tensors used during token generation. Grows with batch and sequence length. Activations: temporary tensors computed during forward pass. Recomputed during backward (gradient checkpointing) or stored (memory-hungry, faster).

How do I estimate VRAM for a custom model? Count parameters: model.num_parameters() in PyTorch. Multiply by bytes per parameter (0.5 for int4, 1 for int8, 2 for fp16, 4 for fp32). Add 20% for workspace and activations. For training, multiply by 3-4 (gradients + optimizer + activations).

Can we use CPU RAM as swap for VRAM? Yes, but catastrophically slow. Useful for development, not production. SwiftTransformer and DeepSpeed support offloading, but throughput drops 10-50x depending on PCIe bandwidth. Only use in emergency or development.



Sources