Contents
- How Much VRAM Run LLM: Overview
- Memory Calculation Fundamentals
- VRAM Requirements by Model
- Optimization Strategies
- GPU Selection by Model and Use Case
- Performance Characteristics by Hardware
- FAQ
- Related Resources
- Sources
How Much VRAM Run LLM: Overview
How Much VRAM Run LLM is the focus of this guide. Determining VRAM requirements for language models involves understanding parameter counts, precision formats, and inference patterns. Correct sizing prevents expensive GPU underutilization and memory overflow failures.
Memory Calculation Fundamentals
Base Model Memory
Each parameter in a neural network consumes memory proportional to the number format used. FP32 (full precision) requires 4 bytes per parameter. BF16 and FP16 reduce this to 2 bytes per parameter.
Base memory formula:
Model parameters × bytes per parameter = base model memory
A 7 billion parameter model in FP32:
7B parameters × 4 bytes = 28GB VRAM
The same model in BF16:
7B parameters × 2 bytes = 14GB VRAM
This foundational calculation represents model weights only, excluding activations and working memory.
Activation Memory
Forward and backward passes generate activation tensors. These activations consume additional memory beyond model weights, typically 2-3× base model memory during training.
Training a 7B parameter model requires approximately:
- Base model: 14GB (BF16)
- Activations: 28-42GB
- Optimizer states: 28GB (for AdamW)
- Total: 70-84GB VRAM
This explains why training large models demands H100 (80GB) or requires distributed training across multiple GPUs.
Inference Memory Overhead
Inference has lower memory requirements than training. Single-batch inference on a 7B model needs:
- Model weights: 14GB (BF16)
- KV cache: 2-4GB (depends on sequence length)
- Workspace: 1-2GB
- Total: 17-20GB
Multi-batch inference increases KV cache proportionally. Batch size 32 increases total requirements to 40-50GB for the same 7B model.
VRAM Requirements by Model
Small Models (1-3B Parameters)
Smaller models like Phi 2 (2.7B), TinyLlama (1.1B), and MobileLLM (1.3B) optimize for edge devices and cost-conscious development.
Memory requirements:
- Phi 2.7B in FP32: 11GB VRAM
- Phi 2.7B in BF16: 5.5GB VRAM
- TinyLlama 1.1B in BF16: 2.2GB VRAM
- MobileLLM 1.3B in BF16: 2.6GB VRAM
These small models fit on consumer GPUs like RTX 4090 (24GB) or older A100 40GB, enabling development on commodity hardware. The trade-off: smaller models provide lower quality outputs for complex reasoning tasks. Phi 2 excels at code generation and math problems but underperforms on open-ended writing.
Inference performance on RTX 4090:
- Single-batch inference: 100-200 tokens/second
- Batch size 4: 300-400 tokens/second aggregate
- Batch size 8: 400-500 tokens/second aggregate
For production inference, small models on consumer GPUs suit applications with modest throughput requirements (100-500 QPS maximum). Higher throughput demands scaling to specialized data center GPUs.
Cost considerations for small model inference:
- RTX 4090 inference on Latitude: $0.90/hour
- Monthly: $648
- Cost per inference request (assuming 50 tokens/second, 1 QPS): $0.18/request
Mid-Range Models (7-13B Parameters)
Llama 2 7B, Mistral 7B, and Code Llama 7B represent the efficiency frontier. These models balance capability and resource requirements, powering most production inference deployments in 2026.
Memory requirements:
- Llama 2 7B in FP32: 28GB
- Llama 2 7B in BF16: 14GB
- Mistral 7B in BF16: 14GB
- Mistral 7B with Flash Attention: 11GB (33% reduction)
- Llama 2 13B in BF16: 26GB
- Code Llama 34B: 68GB BF16
Hardware recommendations:
- A100 40GB: Supports Llama 2 7B with batch size 8-16
- A100 80GB: Supports Llama 2 13B with batch size 8-12
- H100 80GB: Supports Llama 2 13B with batch size 32-64
- RunPod H100 SXM at $2.69/hour: Ideal for mid-range model inference
Inference performance on A100 40GB running Llama 2 7B:
- Single-batch latency: 50ms per token
- Batch size 8: 400-600 tokens/second aggregate throughput
- Batch size 16: 700-900 tokens/second aggregate throughput
Production economics for Llama 2 7B inference:
- A100 40GB via Latitude: $2.10/hour
- Monthly cost: $1,512
- Assuming 50 requests/hour average, 10 tokens/request: $0.0030 per inference request
- Cost competitive with LLM API pricing for Claude 3 and GPT-4 alternatives
Training considerations:
- Fine-tuning Llama 2 7B requires 40GB minimum (16-bit precision)
- Full training of 7B from scratch demands 80GB+ with optimizer states
- Distributed training across 2-4 GPUs enables larger batch sizes and faster convergence
Large Models (34-70B Parameters)
Llama 2 70B, Code Llama 34B, and Mistral Large (45B) require high-end GPUs. Training becomes impractical on single instances, requiring distributed approaches. These models achieve substantial quality improvements over 7B variants, justifying infrastructure complexity for production systems.
Memory requirements:
- Llama 2 34B in BF16: 68GB
- Llama 2 70B in BF16: 140GB (FP32 would require 280GB)
- Code Llama 34B in BF16: 68GB
- Mistral Large 45B in BF16: 90GB
Single-GPU inference challenges:
Comparing NVIDIA H100 specifications at 80GB, Llama 2 70B in BF16 (140GB) cannot fit on single H100. Even 34B models strain H100's capacity with reasonable batch sizes.
Multi-GPU inference solutions:
Tensor parallelism distributes model layers across multiple GPUs:
- 2×H100 cluster: Each GPU runs subset of layers sequentially. Llama 2 70B fits with 70GB per GPU
- 4×H100 cluster: 35GB per GPU, enabling larger batch sizes
- 8×H100 cluster: 17.5GB per GPU, achieving batch size 16-32 production inference
Network overhead varies by clustering approach:
- Ray Distributed: 5-10% overhead for GPUs on same machine
- TCP networking (different machines): 15-25% overhead
- CoreWeave optimized networking: 5-8% overhead with 400Gbps fabric
Production inference recommendations:
For Llama 2 70B, 4×H100 cluster recommended for:
- Batch size 4-8 inference
- 50-100ms latency targets
- 1,000+ concurrent users
Cost via CoreWeave:
- 4×H100: $24.62/hour
- Monthly: $17,727
- Cost per inference request (assuming 100 QPS, 10 tokens): $0.00024 per request
This cost structure makes large model inference economically viable for applications with significant query volume.
Extra-Large Models (200B+ Parameters)
GPT-3 (175B), Chinchilla (70B optimized training), and similar frontier models exceed any single GPU's memory. Production deployments require 16-32 GPU clusters with sophisticated distributed inference pipelines.
Optimization Strategies
Quantization for Memory Reduction
Quantizing model weights from FP32 to INT8 reduces memory by 75%. Running Llama 2 70B in INT8 requires approximately 70GB instead of 280GB.
Quantization trade-offs:
- INT8: Smallest memory footprint, slight quality degradation
- GPTQ: 4-bit quantization, maintains quality near FP16
- NF4: 4-bit with better distribution for transformer models
For production deployment on limited GPUs, quantization is essential. Testing on [Lambda at $3.78/hour (SXM) H100) or RunPod GPU pricing ($2.69/hour) allows rapid experimentation before production deployment.
KV Cache Optimization
During inference, key-value caches for attention mechanisms consume 1-4GB per GPU for typical sequence lengths (2048 tokens). Techniques reducing cache size:
- Flash Attention: Reduces memory by 40-60% through algorithmic improvements
- Paged Attention: Virtual memory management reduces cache fragmentation
- Offloading: Moving KV cache to system RAM saves GPU memory at latency cost
Using Flash Attention on A100 enables Llama 2 7B batch size 16-32, producing 200+ tokens/second throughput on single GPU.
Model Sharding for Large Models
Tensor parallelism distributes models across multiple GPUs. Llama 2 70B sharded across 8×H100 GPUs consumes 9GB per GPU instead of 70GB per GPU.
The trade-off: 7×H100 communication overhead reduces throughput compared to single-GPU inference. Typical latency increases from 50ms to 75-100ms with 8-way parallelism.
Production deployments typically shard across 4-8 GPUs, balancing memory efficiency and latency.
GPU Selection by Model and Use Case
Development and Prototyping
For experimenting with models before production:
- 1-3B models: RTX 4090 (24GB) via Latitude at $0.90/hour
- 7B models: RTX 4090 (24GB) with quantization, or A100 40GB
- 13B models: A100 40GB for comfortable batch sizes
Development costs:
- Prototyping Llama 2 7B: 100 hours × $0.90 = $90
- Prototyping Llama 2 13B: 100 hours × $2.10 (A100) = $210
This cost remains negligible compared to engineering time. Prototyping before production deployment prevents expensive scaling mistakes.
Single-GPU Production Inference
Suitable for applications with <100 QPS requirement:
- 7B models: H100 via RunPod at $2.69/hour
- 13B models: H100 required for batch size 8+
- 34B models: Impossible on single H100 with reasonable batch sizes
Monthly costs:
- H100 continuous: $1,944
- Adding infrastructure overhead: $2,100-2,500 monthly
Single-GPU inference makes economic sense when request volume justifies dedicating $2,500+ monthly to a single model.
Multi-GPU Training
Recommended configurations by model size:
- 7B model fine-tuning: 2×A100 for distributed fine-tuning
- 7B model full training: 4-8×A100 or 2×H100
- 34B model: 8×H100 minimum
- 70B model: 16×H100 minimum for production training
Training timeline estimates:
- 7B model on 2×A100: 7,776 GPU-hours ÷ 2 = 3,888 hours ≈ 6.5 months continuous
- 7B model on 8×A100: 3,888 ÷ 8 = 486 hours ≈ 20 days continuous
- 70B model on 16×H100: ~200 hours with strong distributed training efficiency
Cost-efficient training infrastructure:
Using Alibaba Cloud with 1-year commitment:
- 8×A100 continuous: $9.80/hour × 8 × 0.75 = $58.80/hour
- 500-hour project cost: $29,400
- Same project on AWS on-demand: $43,680+ (49% more expensive)
Performance Characteristics by Hardware
GPU Memory Bandwidth Impact
Memory bandwidth directly affects inference latency. Models with high compute-to-memory ratios (transformers) are memory-bandwidth-bound:
- H100 3,350GB/sec: High throughput for bandwidth-intensive ops
- A100 1,935GB/sec: 42% less bandwidth than H100
- L40S 864GB/sec: 74% less bandwidth than H100
- RTX 4090 1,008GB/sec: Higher bandwidth than L40S
For batch size 32 inference of Llama 2 70B:
- H100: 150ms latency
- A100: 200ms latency (33% slower)
- L40S: 280ms latency (87% slower)
This performance differential justifies H100 pricing premium for production serving demanding latency targets.
FAQ
Q: Can GPT-3 size models run on A100 80GB? A: No. GPT-3 175B requires distributed inference across 8+ H100 GPUs minimum. Single A100 lacks sufficient memory even with quantization.
Q: What's the minimum GPU for running Llama 2 7B? A: A100 40GB or H100 can run single-batch inference. RTX 4090 (24GB) with quantization or int8 enables development workloads.
Q: How much VRAM improvement comes from BF16 versus FP32? A: BF16 halves memory requirements compared to FP32 (2 bytes vs 4 bytes per parameter). FP8 reduces to 1 byte, achieving 75% memory savings but with quality trade-offs.
Q: Does increasing sequence length increase VRAM requirements? A: Yes. KV cache scales linearly with sequence length. 4096-token sequences roughly double cache size compared to 2048 tokens.
Q: Can I run Llama 2 70B on RTX 4090? A: No. RTX 4090 24GB is insufficient for 70B models. Quantization to INT4 reduces requirements to approximately 35GB, still exceeding RTX 4090's 24GB. A100 40GB is the practical minimum.
Related Resources
- GPU Pricing Comparison
- NVIDIA H100 Price Guide
- NVIDIA A100 Price Guide
- NVIDIA H200 Price Guide
- NVIDIA B200 Price Guide
- Lambda GPU Pricing
- RunPod GPU Pricing
- LLM API Pricing
Sources
- NVIDIA GPU memory specifications (as of March 2026)
- Language model architecture documentation
- Inference optimization research and benchmarks
- DeployBase LLM infrastructure analysis
- Hardware manufacturer specifications