Best GPUs for Fine-Tuning LLMs: VRAM & Cost Guide

Best GPU for Fine Tuning: Overview
VRAM Requirements by Model Size
GPU Comparison Table
LoRA vs Full Fine-Tuning
Single GPU vs Multi-GPU Training
Cloud GPU Pricing
Cost-Efficiency Rankings
Model-Specific Recommendations
Optimization Techniques
Advanced Optimization Strategies
Real-World Benchmarks
FAQ
Related Resources
Sources

Best GPU for Fine Tuning: Overview

Fine-tuning GPU choice = model size + budget + training time. VRAM is the bottleneck. 7B needs 16GB. 70B needs 160GB+ for full fine-tuning (LoRA cuts that significantly).

Quick Reference:

7B models: RTX 4090 ($0.34/hr) or L40S ($0.79/hr) sufficient
13B models: A100 PCIe ($1.19/hr) with LoRA
70B models: H100 or A100 8x with LoRA
Cost: $0.34 to $5.98/hr depending on GPU and LoRA choice

VRAM Requirements by Model Size

Memory Math

VRAM needed depends on three factors:

Model weights in bfloat16 (2 bytes per parameter)
Optimizer states (Adam needs 4x: gradient + momentum + variance)
Activations (batch size, sequence length, layers)

For a 7B parameter model in full fine-tuning with batch size 1:

Weights: 7B * 2 bytes = 14GB
Optimizer states: 14GB * 4 = 56GB
Activations: 4-6GB (batch 1, sequence 2048)
Total: 74GB minimum

This exceeds most single GPUs. Solution: LoRA or gradient checkpointing.

VRAM by Model Size (Full Fine-Tuning, Batch 1)

Model Size	Min VRAM	Recommended	GPU Examples
7B	64GB	80GB	A100 8x
13B	120GB	160GB	2x A100 or H100 8x
34B	280GB	320GB	4x A100 or H100 8x
70B	560GB	640GB	8x A100 or H100 8x

Full fine-tuning scales poorly. 70B models need 8-GPU setups costing thousands per day. For pricing on GPUs, explore GPU options.

VRAM with LoRA (Rank 8)

LoRA adds trainable adapters without storing optimizer states for base model.

Model Size	Min VRAM	Recommended	GPU Examples
7B	16GB	24GB	RTX 4090, L40S
13B	24GB	32GB	A100 PCIe, L40S
34B	48GB	64GB	A100 PCIe, H100 PCIe
70B	80GB	96GB	A100 SXM, H100 PCIe

LoRA makes fine-tuning practical on consumer and mid-range GPUs. See A100 vs H100 comparison for performance details.

QLoRA (Quantization + LoRA)

Quantizing base model to 4-bit reduces VRAM further:

Model Size	Min VRAM	GPU Examples
7B	8GB	RTX 4090 with care
13B	10GB	RTX 4090, A10
34B	16GB	RTX 4090, L4
70B	24GB	L40S, A100 PCIe

QLoRA sacrifices speed for VRAM. Training 70B on L40S is feasible but takes 2-3x longer than A100 SXM.

GPU Comparison Table

March 2026 Pricing and Specs

GPU	VRAM	Bandwidth	Cost/hr	Provider
RTX 4090	24GB	1,008 GB/s	$0.34	RunPod
L4	24GB	300 GB/s	$0.44	RunPod
L40S	48GB	864 GB/s	$0.79	RunPod
A100 PCIe	80GB	2,039 GB/s	$1.19	RunPod
A100 SXM	80GB	2,039 GB/s	$1.39	RunPod
H100 PCIe	80GB	2,000 GB/s	$1.99	RunPod
H100 SXM	80GB	3,456 GB/s	$2.69	RunPod
H200	141GB	4,800 GB/s	$3.59	RunPod
B200	192GB	8,000 GB/s	$5.98	RunPod

Pricing reflects March 2026 rates. Lambda and CoreWeave offer similar models at different prices.

Cost Per GB VRAM

GPU	$/GB/hr
RTX 4090	$0.0142
L40S	$0.0165
A100 PCIe	$0.0149
H100 PCIe	$0.0249
H100 SXM	$0.0336
B200	$0.0311
H200	$0.0255

RTX 4090 is cheapest by VRAM cost. However, slower training means higher total cost for long jobs.

Speed Comparison (Tokens/Second During Training)

For 7B model with LoRA, batch size 4:

GPU	Tokens/sec	Time for 10B tokens
RTX 4090	480	5.8 hours
L40S	640	4.3 hours
A100 PCIe	920	3.0 hours
H100 PCIe	1,440	1.9 hours

H100 is 3x faster than RTX 4090. For 10-hour fine-tuning jobs:

RTX 4090: 10 hrs * $0.34 = $3.40
H100 PCIe: 1.9 hrs * $1.99 = $3.78

Total cost is similar, but H100 finishes in 1/5 the time (iteration speed matters).

LoRA vs Full Fine-Tuning

Quality Comparison

Full fine-tuning updates all weights. Quality is highest but expensive.

LoRA trains only rank-8 (or rank-4) adapters. Quality depends on rank selection.

Approach	VRAM (7B)	Quality	Training Time
Full FT	64GB+	100%	1.0x baseline
LoRA-8	16GB	94-96%	1.0x baseline
LoRA-4	12GB	90-92%	1.0x baseline
QLoRA-4	8GB	88-90%	2.5x baseline

LoRA-8 captures 95% of quality at 1/4 the VRAM. This is the practical standard.

Cost Example: Fine-Tune Llama 2 7B

Task: Fine-tune on 10B tokens of domain-specific data.

Full Fine-Tuning

Required: A100 8x setup ($21.60/hr)
Training time: 15 hours
Cost: 15 * $21.60 = $324

LoRA-8

Required: L40S ($0.79/hr)
Training time: 15 hours
Cost: 15 * $0.79 = $11.85

Savings: 96.3%

Quality difference: 95-96% versus 100%. For most applications, negligible.

Single GPU vs Multi-GPU Training

Distributed Training Overhead

Adding GPUs reduces training time but introduces overhead:

Data loading synchronization
Gradient aggregation
Network latency between GPUs

Efficiency typically drops 10-20% per added GPU.

Scaling Laws

For a 13B model with LoRA:

Setup	Training Time	Throughput	Cost
1x A100	20 hours	850 T/s	$23.80
2x A100	11 hours	1,550 T/s	$30.58
4x A100	6 hours	2,800 T/s	$34.32

Adding a 4th GPU only saves 1 hour and costs more. Diminishing returns kick in.

Multi-GPU Strategy

For most fine-tuning, use 1-2 GPUs. Beyond 2 GPUs, infrastructure complexity (Distributed Data Parallel, DeepSpeed) outweighs time savings for typical workloads.

Exception: 70B models benefit from 2+ GPUs even with LoRA.

Cloud GPU Pricing

RunPod Pricing (March 2026)

RunPod offers on-demand and reserved pricing:

GPU	On-Demand	Reserved (annual)
RTX 4090	$0.34/hr	$1,700/yr
L40S	$0.79/hr	$3,960/yr
A100 PCIe	$1.19/hr	$5,950/yr
H100 PCIe	$1.99/hr	$9,950/yr

Reserved pricing reduces hourly cost by 40-45%. Check H100 pricing for detailed cost breakdowns.

Lambda Pricing (March 2026)

GPU	Cost/hr
A10	$0.86
A100 PCIe	$1.48
GH200	$1.99
H100 PCIe	$2.86
H100 SXM	$3.78

Lambda's SXM ($3.78/hr) is more expensive than RunPod's H100 SXM ($2.69/hr). Lambda's PCIe ($2.86/hr) is also slightly more expensive than RunPod H100 PCIe ($1.99/hr).

CoreWeave Pricing (March 2026)

CoreWeave focuses on large-scale deployments (8x GPU pods):

Pod	Cost/hr
8x A100	$21.60
8x H100	$49.24
8x H200	$50.44
8x B200	$68.80

Cheaper than RunPod per GPU but requires multi-GPU commitment.

Cost-Efficiency Rankings

For fine-tuning a 13B model with LoRA on 10B tokens:

Rank 1: RTX 4090 + LoRA

Hardware cost: $0.34/hr
Training time: 12.5 hours
Total: $4.25
Best for: Budget-conscious teams

Rank 2: L40S + LoRA

Hardware cost: $0.79/hr
Training time: 9.5 hours
Total: $7.51
Best for: Balanced cost and speed

Rank 3: A100 PCIe + LoRA

Hardware cost: $1.19/hr
Training time: 6.5 hours
Total: $7.74
Best for: Speed matters slightly more

Rank 4: H100 PCIe + LoRA

Hardware cost: $1.99/hr
Training time: 4 hours
Total: $7.96
Best for: Iteration speed critical

For this 13B LoRA task, cost differences are small ($4.25 to $7.96). Choose based on iteration speed requirements.

For 70B + Full Fine-Tuning

Costs explode without LoRA. 8x A100 costing $86.40/hr for 40 hours = $3,456.

Full fine-tuning is only viable at scale (models used in production, training many variants).

Model-Specific Recommendations

7B Models (Mistral 7B, Llama 2 7B)

Best Choice: RTX 4090 + LoRA

Cost: ~$4
Training time: 12 hours for 10B tokens
Reasoning: Sufficient VRAM, cost-effective

Alternative: L4 + QLoRA

Cost: ~$3
Training time: 20 hours (2x slower)
Trade-off: Cheaper, slower

13B Models (Llama 2 13B, Mistral 13B)

Best Choice: L40S + LoRA

Cost: ~$8
Training time: 9 hours for 10B tokens
Reasoning: Balanced speed and cost

Alternative: A100 PCIe + LoRA (for faster iteration)

Cost: ~$8
Training time: 6 hours
Trade-off: Slightly faster for same cost

34B Models (Mistral MoE, Code Llama 34B)

Best Choice: A100 PCIe + LoRA or 2x L40S

Cost: ~$12-16 per 10B tokens
Training time: 8-10 hours
Reasoning: Need larger VRAM for 34B

70B Models (Llama 2 70B, Mistral 70B)

Best Choice: H100 PCIe + LoRA

Cost: ~$20 per 10B tokens
Training time: 10 hours
Reasoning: 80GB VRAM required, speed justifies cost

Alternative: A100 SXM + LoRA (requires 2+ GPUs)

Cost: ~$28
Training time: 15 hours
Trade-off: More GPUs, slightly more cost

Optimization Techniques

Memory Optimization

Gradient Checkpointing: Trade computation for memory. Recompute activations instead of storing them. Reduces VRAM by 20-30%, increases compute time by 20%.

Flash Attention: Use Flash Attention implementation for lower VRAM usage during attention computation. Especially helpful for long sequences.

Mixed Precision: Train in bfloat16 instead of float32. Halves VRAM, no quality loss on modern GPUs.

Speed Optimization

Compiler: Use PyTorch compile (Triton backend) for 10-15% speedup on A100/H100.

FSDP: Fully Sharded Data Parallel for multi-GPU scaling. Distributes model across GPUs more efficiently than simple DDP.

Token Dropping: Skip gradient computation on some tokens (randomly). Reduces compute by 20%, minimal quality impact.

Combined: QLoRA on RTX 4090

Model: 70B Llama 2
Quantization: 4-bit
LoRA: rank-8
Gradient Checkpointing: enabled
Flash Attention: enabled
Batch Size: 2

Results:
VRAM used: 19.5GB (out of 24GB)
Throughput: 280 tokens/sec
Training 10B tokens: 10 hours
Cost: $3.40 total

This configuration makes 70B fine-tuning accessible on consumer hardware.

Advanced Optimization Strategies

Production Fine-Tuning Pipelines

Real production systems combine multiple optimization techniques:

Scenario: Fine-tune 13B model weekly on 100B new tokens

Setup:

Hardware: 2x L40S ($0.79/hr each)
Optimization: LoRA-8 + gradient checkpointing + Flash Attention
Batch size: 4 per GPU (8 total)
Data preprocessing: Parallel on CPU

Cost breakdown:

Data prep: 2 hours = $0 (CPU-only)
Training: 50 hours = 50 * $1.58 = $79
Evaluation: 2 hours = $3.16
Total: $82.16 per week

Cost per token trained: $0.00082/token

This represents excellent efficiency. Single GPU (A100) would cost $200+.

Quantization Strategies

For budget-constrained scenarios:

4-bit Quantization (QLoRA):

VRAM reduction: 50-60%
Speed penalty: 20-30%
Quality: 92-95% of full precision
Best for: Tight VRAM budgets

8-bit Quantization:

VRAM reduction: 30-40%
Speed penalty: 10-15%
Quality: 96-98% of full precision
Best for: Balanced approach

INT8 vs NF4:

INT8: Simpler, faster inference, slightly worse quality
NF4: Better quality, slower inference, more complex

For production, NF4 + 8-bit optimizer states is standard.

Real-World Benchmarks

Fine-Tune Mistral 7B on Custom Dataset

Setup: L40S + LoRA-8, batch size 4, 10k training samples (50B tokens)

Phase	Time	Cost
Data preparation	0.5 hr	$0.40
Fine-tuning (50B tokens)	60 hrs	$47.40
Evaluation	0.5 hr	$0.40
Total	61 hrs	$48.20

Cost per million tokens: $0.0009

Fine-Tune Llama 2 70B on Code Dataset

Setup: H100 PCIe + LoRA-16, batch size 2, 20k training samples (100B tokens)

Phase	Time	Cost
Data preparation	1 hr	$1.99
Fine-tuning (100B tokens)	70 hrs	$139.30
Evaluation	1 hr	$1.99
Total	72 hrs	$143.28

Cost per million tokens: $0.0014

FAQ

Q: Is fine-tuning worth it versus prompt engineering?

Fine-tuning is worth it for repeated, specialized tasks. Break-even point: 1,000+ inference calls. For one-off tasks, prompt engineering is cheaper.

Q: Should I fine-tune or train from scratch?

Fine-tune. Training from scratch costs millions and requires massive compute. Only large companies with proprietary datasets do it.

Q: Can I fine-tune on consumer GPU (RTX 4090)?

Yes. With LoRA and QLoRA, fine-tune 7B-34B models on RTX 4090. 70B requires quantization or multiple GPUs.

Q: What's the difference between LoRA rank 4, 8, 16?

Higher rank captures more information but needs more VRAM and reduces efficiency. Rank 8 is the standard. Use rank 4 for constrained VRAM, rank 16 only if quality suffers.

Q: How many tokens should I use for fine-tuning?

Quality plateaus around 10B-20B tokens for smaller models. 70B models benefit from 50B+ tokens. More data beats better hardware.

Q: Which provider should I use?

RunPod for the best per-GPU pricing. Lambda for reliability. CoreWeave for large multi-GPU setups. All three work well.

Q: How should I handle batch size when VRAM is tight?

Reduce batch size to 1 (if possible) before trying QLoRA. Smaller batches train slower but need less VRAM. Once VRAM is exhausted, switch to QLoRA. Never sacrifice quality for speed; adjust batch size instead.

Q: When should I move from fine-tuning to custom training?

Fine-tune when you have 10M+ high-quality tokens. Train from scratch only if you have 100B+ proprietary tokens and budget for weeks of training. Most teams fine-tune.

Q: Can I share a fine-tuned model with others?

Yes. Save the LoRA adapters (usually 10-100MB), share with others. They load base model + adapters. Adapters are small and easy to distribute. For proprietary models, check licensing terms.

Sources

RunPod Pricing (March 2026)
Lambda Pricing (March 2026)
CoreWeave Pricing (March 2026)
Fine-tuning VRAM Requirements Study
LoRA Paper: Hu et al. 2021
QLoRA Paper: Dettmers et al. 2023

Contents