Best GPUs for Fine-Tuning LLMs: VRAM & Cost Guide

Deploybase · March 2, 2026 · GPU Cloud

Contents

Best GPU for Fine Tuning: Overview

Fine-tuning GPU choice = model size + budget + training time. VRAM is the bottleneck. 7B needs 16GB. 70B needs 160GB+ for full fine-tuning (LoRA cuts that significantly).

Quick Reference:

  • 7B models: RTX 4090 ($0.34/hr) or L40S ($0.79/hr) sufficient
  • 13B models: A100 PCIe ($1.19/hr) with LoRA
  • 70B models: H100 or A100 8x with LoRA
  • Cost: $0.34 to $5.98/hr depending on GPU and LoRA choice

VRAM Requirements by Model Size

Memory Math

VRAM needed depends on three factors:

  1. Model weights in bfloat16 (2 bytes per parameter)
  2. Optimizer states (Adam needs 4x: gradient + momentum + variance)
  3. Activations (batch size, sequence length, layers)

For a 7B parameter model in full fine-tuning with batch size 1:

  • Weights: 7B * 2 bytes = 14GB
  • Optimizer states: 14GB * 4 = 56GB
  • Activations: 4-6GB (batch 1, sequence 2048)
  • Total: 74GB minimum

This exceeds most single GPUs. Solution: LoRA or gradient checkpointing.

VRAM by Model Size (Full Fine-Tuning, Batch 1)

Model SizeMin VRAMRecommendedGPU Examples
7B64GB80GBA100 8x
13B120GB160GB2x A100 or H100 8x
34B280GB320GB4x A100 or H100 8x
70B560GB640GB8x A100 or H100 8x

Full fine-tuning scales poorly. 70B models need 8-GPU setups costing thousands per day. For pricing on GPUs, explore GPU options.

VRAM with LoRA (Rank 8)

LoRA adds trainable adapters without storing optimizer states for base model.

Model SizeMin VRAMRecommendedGPU Examples
7B16GB24GBRTX 4090, L40S
13B24GB32GBA100 PCIe, L40S
34B48GB64GBA100 PCIe, H100 PCIe
70B80GB96GBA100 SXM, H100 PCIe

LoRA makes fine-tuning practical on consumer and mid-range GPUs. See A100 vs H100 comparison for performance details.

QLoRA (Quantization + LoRA)

Quantizing base model to 4-bit reduces VRAM further:

Model SizeMin VRAMGPU Examples
7B8GBRTX 4090 with care
13B10GBRTX 4090, A10
34B16GBRTX 4090, L4
70B24GBL40S, A100 PCIe

QLoRA sacrifices speed for VRAM. Training 70B on L40S is feasible but takes 2-3x longer than A100 SXM.

GPU Comparison Table

March 2026 Pricing and Specs

GPUVRAMBandwidthCost/hrProvider
RTX 409024GB1,008 GB/s$0.34RunPod
L424GB300 GB/s$0.44RunPod
L40S48GB864 GB/s$0.79RunPod
A100 PCIe80GB2,039 GB/s$1.19RunPod
A100 SXM80GB2,039 GB/s$1.39RunPod
H100 PCIe80GB2,000 GB/s$1.99RunPod
H100 SXM80GB3,456 GB/s$2.69RunPod
H200141GB4,800 GB/s$3.59RunPod
B200192GB8,000 GB/s$5.98RunPod

Pricing reflects March 2026 rates. Lambda and CoreWeave offer similar models at different prices.

Cost Per GB VRAM

GPU$/GB/hr
RTX 4090$0.0142
L40S$0.0165
A100 PCIe$0.0149
H100 PCIe$0.0249
H100 SXM$0.0336
B200$0.0311
H200$0.0255

RTX 4090 is cheapest by VRAM cost. However, slower training means higher total cost for long jobs.

Speed Comparison (Tokens/Second During Training)

For 7B model with LoRA, batch size 4:

GPUTokens/secTime for 10B tokens
RTX 40904805.8 hours
L40S6404.3 hours
A100 PCIe9203.0 hours
H100 PCIe1,4401.9 hours

H100 is 3x faster than RTX 4090. For 10-hour fine-tuning jobs:

  • RTX 4090: 10 hrs * $0.34 = $3.40
  • H100 PCIe: 1.9 hrs * $1.99 = $3.78

Total cost is similar, but H100 finishes in 1/5 the time (iteration speed matters).

LoRA vs Full Fine-Tuning

Quality Comparison

Full fine-tuning updates all weights. Quality is highest but expensive.

LoRA trains only rank-8 (or rank-4) adapters. Quality depends on rank selection.

ApproachVRAM (7B)QualityTraining Time
Full FT64GB+100%1.0x baseline
LoRA-816GB94-96%1.0x baseline
LoRA-412GB90-92%1.0x baseline
QLoRA-48GB88-90%2.5x baseline

LoRA-8 captures 95% of quality at 1/4 the VRAM. This is the practical standard.

Cost Example: Fine-Tune Llama 2 7B

Task: Fine-tune on 10B tokens of domain-specific data.

Full Fine-Tuning

  • Required: A100 8x setup ($21.60/hr)
  • Training time: 15 hours
  • Cost: 15 * $21.60 = $324

LoRA-8

  • Required: L40S ($0.79/hr)
  • Training time: 15 hours
  • Cost: 15 * $0.79 = $11.85

Savings: 96.3%

Quality difference: 95-96% versus 100%. For most applications, negligible.

Single GPU vs Multi-GPU Training

Distributed Training Overhead

Adding GPUs reduces training time but introduces overhead:

  • Data loading synchronization
  • Gradient aggregation
  • Network latency between GPUs

Efficiency typically drops 10-20% per added GPU.

Scaling Laws

For a 13B model with LoRA:

SetupTraining TimeThroughputCost
1x A10020 hours850 T/s$23.80
2x A10011 hours1,550 T/s$30.58
4x A1006 hours2,800 T/s$34.32

Adding a 4th GPU only saves 1 hour and costs more. Diminishing returns kick in.

Multi-GPU Strategy

For most fine-tuning, use 1-2 GPUs. Beyond 2 GPUs, infrastructure complexity (Distributed Data Parallel, DeepSpeed) outweighs time savings for typical workloads.

Exception: 70B models benefit from 2+ GPUs even with LoRA.

Cloud GPU Pricing

RunPod Pricing (March 2026)

RunPod offers on-demand and reserved pricing:

GPUOn-DemandReserved (annual)
RTX 4090$0.34/hr$1,700/yr
L40S$0.79/hr$3,960/yr
A100 PCIe$1.19/hr$5,950/yr
H100 PCIe$1.99/hr$9,950/yr

Reserved pricing reduces hourly cost by 40-45%. Check H100 pricing for detailed cost breakdowns.

Lambda Pricing (March 2026)

GPUCost/hr
A10$0.86
A100 PCIe$1.48
GH200$1.99
H100 PCIe$2.86
H100 SXM$3.78

Lambda's SXM ($3.78/hr) is more expensive than RunPod's H100 SXM ($2.69/hr). Lambda's PCIe ($2.86/hr) is also slightly more expensive than RunPod H100 PCIe ($1.99/hr).

CoreWeave Pricing (March 2026)

CoreWeave focuses on large-scale deployments (8x GPU pods):

PodCost/hr
8x A100$21.60
8x H100$49.24
8x H200$50.44
8x B200$68.80

Cheaper than RunPod per GPU but requires multi-GPU commitment.

Cost-Efficiency Rankings

For fine-tuning a 13B model with LoRA on 10B tokens:

Rank 1: RTX 4090 + LoRA

  • Hardware cost: $0.34/hr
  • Training time: 12.5 hours
  • Total: $4.25
  • Best for: Budget-conscious teams

Rank 2: L40S + LoRA

  • Hardware cost: $0.79/hr
  • Training time: 9.5 hours
  • Total: $7.51
  • Best for: Balanced cost and speed

Rank 3: A100 PCIe + LoRA

  • Hardware cost: $1.19/hr
  • Training time: 6.5 hours
  • Total: $7.74
  • Best for: Speed matters slightly more

Rank 4: H100 PCIe + LoRA

  • Hardware cost: $1.99/hr
  • Training time: 4 hours
  • Total: $7.96
  • Best for: Iteration speed critical

For this 13B LoRA task, cost differences are small ($4.25 to $7.96). Choose based on iteration speed requirements.

For 70B + Full Fine-Tuning

Costs explode without LoRA. 8x A100 costing $86.40/hr for 40 hours = $3,456.

Full fine-tuning is only viable at scale (models used in production, training many variants).

Model-Specific Recommendations

7B Models (Mistral 7B, Llama 2 7B)

Best Choice: RTX 4090 + LoRA

  • Cost: ~$4
  • Training time: 12 hours for 10B tokens
  • Reasoning: Sufficient VRAM, cost-effective

Alternative: L4 + QLoRA

  • Cost: ~$3
  • Training time: 20 hours (2x slower)
  • Trade-off: Cheaper, slower

13B Models (Llama 2 13B, Mistral 13B)

Best Choice: L40S + LoRA

  • Cost: ~$8
  • Training time: 9 hours for 10B tokens
  • Reasoning: Balanced speed and cost

Alternative: A100 PCIe + LoRA (for faster iteration)

  • Cost: ~$8
  • Training time: 6 hours
  • Trade-off: Slightly faster for same cost

34B Models (Mistral MoE, Code Llama 34B)

Best Choice: A100 PCIe + LoRA or 2x L40S

  • Cost: ~$12-16 per 10B tokens
  • Training time: 8-10 hours
  • Reasoning: Need larger VRAM for 34B

70B Models (Llama 2 70B, Mistral 70B)

Best Choice: H100 PCIe + LoRA

  • Cost: ~$20 per 10B tokens
  • Training time: 10 hours
  • Reasoning: 80GB VRAM required, speed justifies cost

Alternative: A100 SXM + LoRA (requires 2+ GPUs)

  • Cost: ~$28
  • Training time: 15 hours
  • Trade-off: More GPUs, slightly more cost

Optimization Techniques

Memory Optimization

Gradient Checkpointing: Trade computation for memory. Recompute activations instead of storing them. Reduces VRAM by 20-30%, increases compute time by 20%.

Flash Attention: Use Flash Attention implementation for lower VRAM usage during attention computation. Especially helpful for long sequences.

Mixed Precision: Train in bfloat16 instead of float32. Halves VRAM, no quality loss on modern GPUs.

Speed Optimization

Compiler: Use PyTorch compile (Triton backend) for 10-15% speedup on A100/H100.

FSDP: Fully Sharded Data Parallel for multi-GPU scaling. Distributes model across GPUs more efficiently than simple DDP.

Token Dropping: Skip gradient computation on some tokens (randomly). Reduces compute by 20%, minimal quality impact.

Combined: QLoRA on RTX 4090

Model: 70B Llama 2
Quantization: 4-bit
LoRA: rank-8
Gradient Checkpointing: enabled
Flash Attention: enabled
Batch Size: 2

Results:
VRAM used: 19.5GB (out of 24GB)
Throughput: 280 tokens/sec
Training 10B tokens: 10 hours
Cost: $3.40 total

This configuration makes 70B fine-tuning accessible on consumer hardware.

Advanced Optimization Strategies

Production Fine-Tuning Pipelines

Real production systems combine multiple optimization techniques:

Scenario: Fine-tune 13B model weekly on 100B new tokens

Setup:

  • Hardware: 2x L40S ($0.79/hr each)
  • Optimization: LoRA-8 + gradient checkpointing + Flash Attention
  • Batch size: 4 per GPU (8 total)
  • Data preprocessing: Parallel on CPU

Cost breakdown:

  • Data prep: 2 hours = $0 (CPU-only)
  • Training: 50 hours = 50 * $1.58 = $79
  • Evaluation: 2 hours = $3.16
  • Total: $82.16 per week

Cost per token trained: $0.00082/token

This represents excellent efficiency. Single GPU (A100) would cost $200+.

Quantization Strategies

For budget-constrained scenarios:

4-bit Quantization (QLoRA):

  • VRAM reduction: 50-60%
  • Speed penalty: 20-30%
  • Quality: 92-95% of full precision
  • Best for: Tight VRAM budgets

8-bit Quantization:

  • VRAM reduction: 30-40%
  • Speed penalty: 10-15%
  • Quality: 96-98% of full precision
  • Best for: Balanced approach

INT8 vs NF4:

  • INT8: Simpler, faster inference, slightly worse quality
  • NF4: Better quality, slower inference, more complex

For production, NF4 + 8-bit optimizer states is standard.

Real-World Benchmarks

Fine-Tune Mistral 7B on Custom Dataset

Setup: L40S + LoRA-8, batch size 4, 10k training samples (50B tokens)

PhaseTimeCost
Data preparation0.5 hr$0.40
Fine-tuning (50B tokens)60 hrs$47.40
Evaluation0.5 hr$0.40
Total61 hrs$48.20

Cost per million tokens: $0.0009

Fine-Tune Llama 2 70B on Code Dataset

Setup: H100 PCIe + LoRA-16, batch size 2, 20k training samples (100B tokens)

PhaseTimeCost
Data preparation1 hr$1.99
Fine-tuning (100B tokens)70 hrs$139.30
Evaluation1 hr$1.99
Total72 hrs$143.28

Cost per million tokens: $0.0014

FAQ

Q: Is fine-tuning worth it versus prompt engineering?

Fine-tuning is worth it for repeated, specialized tasks. Break-even point: 1,000+ inference calls. For one-off tasks, prompt engineering is cheaper.

Q: Should I fine-tune or train from scratch?

Fine-tune. Training from scratch costs millions and requires massive compute. Only large companies with proprietary datasets do it.

Q: Can I fine-tune on consumer GPU (RTX 4090)?

Yes. With LoRA and QLoRA, fine-tune 7B-34B models on RTX 4090. 70B requires quantization or multiple GPUs.

Q: What's the difference between LoRA rank 4, 8, 16?

Higher rank captures more information but needs more VRAM and reduces efficiency. Rank 8 is the standard. Use rank 4 for constrained VRAM, rank 16 only if quality suffers.

Q: How many tokens should I use for fine-tuning?

Quality plateaus around 10B-20B tokens for smaller models. 70B models benefit from 50B+ tokens. More data beats better hardware.

Q: Which provider should I use?

RunPod for the best per-GPU pricing. Lambda for reliability. CoreWeave for large multi-GPU setups. All three work well.

Q: How should I handle batch size when VRAM is tight?

Reduce batch size to 1 (if possible) before trying QLoRA. Smaller batches train slower but need less VRAM. Once VRAM is exhausted, switch to QLoRA. Never sacrifice quality for speed; adjust batch size instead.

Q: When should I move from fine-tuning to custom training?

Fine-tune when you have 10M+ high-quality tokens. Train from scratch only if you have 100B+ proprietary tokens and budget for weeks of training. Most teams fine-tune.

Q: Can I share a fine-tuned model with others?

Yes. Save the LoRA adapters (usually 10-100MB), share with others. They load base model + adapters. Adapters are small and easy to distribute. For proprietary models, check licensing terms.

Sources

  • RunPod Pricing (March 2026)
  • Lambda Pricing (March 2026)
  • CoreWeave Pricing (March 2026)
  • Fine-tuning VRAM Requirements Study
  • LoRA Paper: Hu et al. 2021
  • QLoRA Paper: Dettmers et al. 2023