RTX 4090 Cloud Rental: 2026 Pricing and Use Case Guide

Overview
RTX 4090 Cloud Pricing
RTX 4090 Specifications
VRAM and Model Fit
Inference Performance
Training Performance
Cost Per Token Analysis
When to Use RTX 4090
Cost vs A100 and H100
Real-World Workload Examples
FAQ
Related Resources
Sources

Overview

The RTX 4090 costs $0.34/hr on-demand on RunPod. Spot: $0.22/hr (35% off). 24GB GDDR6X, built for gaming but works for LLM inference and light training.

It's cheap, which is why budget teams use it. But there are tradeoffs: bandwidth is 50% slower than A100/H100. Training speed suffers. Batch sizes are limited. And GDDR6X heats up under sustained load, triggering thermal throttling during heavy inference.

RTX 4090 Cloud Pricing

Provider	GPU	VRAM	Spot/hr	On-Demand/hr	Monthly (On-Demand)	Annual
RunPod	RTX 4090	24GB	$0.22	$0.34	$248	$2,976

Data as of March 2026. RunPod is the primary cloud provider offering RTX 4090 rental. Vast.AI has some RTX 4090 listings from individual miners, but pricing is unstable (ranges $0.15-$0.40/hr depending on supply, time of day, and miner demand).

RTX 4090 Specifications

Spec	RTX 4090	A100	H100	B200	Advantage
Memory	24GB GDDR6X	80GB HBM2e	80GB HBM3	192GB HBM3e	A100/H100 (3.3-7.5x)
Bandwidth	1,008 GB/s	1,935 GB/s	3,350 GB/s	8,533 GB/s	B200 (8.5x)
Peak FP32	82.6 TFLOPS	19.5 TFLOPS	67 TFLOPS	660 TFLOPS	B200 (8x)
TF32 Tensor	331 TFLOPS	312 TFLOPS	1,457 TFLOPS	10,560 TFLOPS	B200 (32x)
FP8 Tensor	Not native	Not native	5,825 TFLOPS	42,240 TFLOPS	B200 (only)
Memory Type	GDDR6X	HBM2e	HBM3	HBM3e	HBM favors training
TDP	450W	400W	700W	1,000W	RTX 4090 (lowest)
Price/hr (On-Demand)	$0.34	$1.19-$1.48	$1.99-$3.78	$5.98-$6.08	RTX 4090 (cheapest)

The 4090 is a gaming card. Memory bandwidth (1,008 GB/s) optimized for graphics, not tensors. A100 is 1.9x faster at memory bandwidth. H100 is 3.3x faster. B200 is 8.5x faster. For inference workloads where throughput isn't the bottleneck, cheaper per-hour pricing makes up the gap.

VRAM and Model Fit

24GB is the critical constraint. Models that fit:

Native Precision (FP32 or FP16):

7B parameter models: ~14-16GB (4 bytes per parameter in float32)
13B models: ~26-30GB (exceeds VRAM, requires quantization)

4-bit Quantization (QLoRA, bitsandbytes):

7B models: ~4GB (1 byte per parameter + overhead)
13B models: ~8GB
30B models: ~18GB
70B models: ~35GB (exceeds VRAM, requires CPU offloading)

Inference with KV-Cache (batch=32):

7B models: ~10GB (model + KV cache + batch activations)
13B models: ~18GB
30B models: ~30GB+ (overflows, causes latency impact or OOM)

VRAM Pressure and Degradation: Operating within 20GB of 24GB causes memory pressure. Runtime performance degrades 10-20% as the OS swaps to slower memory. Running at 23GB (near-full) causes throttling, latency spikes.

The 24GB ceiling is the defining limitation. A100 has 80GB HBM2e (way faster memory). Use the 4090 for single-model inference (7B-13B max) or fine-tuning with aggressive quantization. Anything bigger needs A100 or H100.

Inference Performance

Throughput (Tokens Per Second)

Benchmark: Serving Mistral 7B on a single RTX 4090 with KV-cache optimization.

RTX 4090 (24GB GDDR6X):

Batch size 1: ~140-160 tok/s
Batch size 8: ~180-200 tok/s
Batch size 32: ~240-280 tok/s
Batch size 64: ~280-320 tok/s (VRAM pressure, latency impact)

A100 PCIe (80GB HBM2e):

Batch size 32: ~280-320 tok/s
Batch size 128: ~350-400 tok/s

H100 PCIe (80GB HBM3):

Batch size 32: ~700-750 tok/s
Batch size 128: ~850-1,000 tok/s

The 4090 is 15-30% slower per token (GDDR6X bandwidth limits, smaller memory). Gap narrows at tiny batch sizes. Single-digit batches actually favor 4090 due to lower overhead.

Latency (P50 Latency Per Token)

RTX 4090: 3-5ms per token (batch 8-32, no thermal throttling) RTX 4090 (throttled): 6-10ms per token (sustained load, GDDR6X heats up) A100: 2-3ms per token (batch 32-128) H100: 1.0-1.5ms per token (batch 128)

Latency is fine for batch jobs. Bad for interactive chat where <2ms matters.

Thermal throttling happens. The 4090 has a 450W TDP. After 10-15 minutes at full load, it hits 80-85°C. Thermal governors kick in, cut clocks, drop throughput 15-25%. A100 and H100 have better thermal design (built for 24/7 datacenters).

Cost Per Million Tokens

RTX 4090 inference (7B model, batch 32):

Throughput: 260 tok/s
Cost: $0.34/hr on-demand = $93.33 per 1M tokens (at full cost)
Cost (spot): $0.22/hr = $60.77 per 1M tokens

A100 PCIe inference (7B model, batch 32):

Throughput: 300 tok/s
Cost: $1.19/hr = $396 per 1M tokens (RunPod spot)

H100 PCIe inference (7B model, batch 32):

Throughput: 850 tok/s
Cost: $1.99/hr = $233 per 1M tokens (RunPod spot)

The 4090 spot is 82% cheaper per token than H100 spot. For batch jobs, it wins on cost.

Training Performance

Fine-Tuning with LoRA

Training a 7B model with LoRA (parameter-efficient fine-tuning):

Batch size: 16
Training time: 30-40 hours
Cost (on-demand): $10.20-$13.60
Cost (spot): $6.60-$8.80

Same job on A100 (spot): 20 hours, $23.80 cost.

The 4090 is slower (1.5-2x) but cheaper ($7-14 vs $24). Budget-constrained teams win on total cost. If speed matters (weekly iterations), A100 is better.

Pre-training (Larger Models)

Pre-training from scratch? Impractical. Bandwidth bottleneck makes 13B+ models inefficient. No NVLink on the 4090, so distributed training is slow (PCIe is 50x slower).

Single 4090 trains 7B at ~300 samples/sec (batch 32). 1T tokens would take ~6 hours (unrealistic for real pre-training). Real pre-training needs 8-16 GPUs. PCIe between 4090s (12.8 GB/s) kills it.

Don't use 4090 for pre-training. Go A100 clusters minimum.

Cost Per Token Analysis

Batch Inference Workload (5M tokens/day)

Processing documents, logs, data annotation. Non-interactive, latency-flexible.

RTX 4090 setup (spot, $0.22/hr):

Throughput: 260 tok/s aggregate
Time to process 5M tokens: 19,231 seconds = 5.34 hours/day
Cost per day: 5.34 hrs × $0.22 = $1.17
Monthly: $35
Annual: $426

A100 setup (spot, 1x $1.19/hr):

Throughput: 300 tok/s
Time to process 5M tokens: 16,667 seconds = 4.63 hours/day
Cost per day: 4.63 hrs × $1.19 = $5.51
Monthly: $165
Annual: $1,980

The 4090 is 82% cheaper annually. 7% slower throughput, but cost wins.

Interactive API Inference (100K tokens/day)

Low-throughput interactive service. Latency matters.

RTX 4090 (on-demand, $0.34/hr):

Throughput: 260 tok/s
Time to process 100K tokens: 385 seconds = 6.4 minutes/day
Running cost (keep instance warm 24/7): $0.34/hr × 730 = $248/month

A100 (on-demand, $1.48/hr, Lambda):

Throughput: 300 tok/s
Time to process 100K tokens: 333 seconds = 5.5 minutes/day
Running cost (keep instance warm 24/7): $1.48/hr × 730 = $1,080/month

The 4090 is 77% cheaper monthly. Latency is 2ms slower (noticeable but fine for non-real-time APIs).

When to Use RTX 4090

Budget-Conscious Batch Inference

Running 7B-13B models where latency doesn't matter. Batch jobs, overnight processing, document analysis. Cost-per-token favors 4090. Monthly: $248 on-demand or $160 spot (730 hrs continuous).

Stack 3-4 4090s for parallel batch work: ~$744-992/month on-demand. Cheaper than one A100 at $869/month spot, similar throughput.

Lightweight Fine-Tuning

Single 7B fine-tuning run: $7-14 spot, $10-13.60 on-demand. Slower (30-40 hrs vs 6-7 on H100), but acceptable for one-off jobs or R&D.

Fine-tune LoRA on 4090, deploy on same GPU for inference. Full pipeline cost: $20-30 per experiment.

Prototyping and Experimentation

Testing a new inference optimization? Spin up 4090, test, iterate, kill. Cost-of-failure is low at $0.34/hr on-demand or $0.22/hr spot. Budget 20-30 experimental runs at $10 each = $200-300 total. H100 costs $3.98 per experiment = $80-120 for 20 runs. 4090 wins for low-risk iteration.

Development Workloads

Loading model weights, testing pipelines, validating datasets. 4090 works for dev (not production). Reserve H100 for production. Development cost: $3-5/day on 4090 vs $30-50/day on H100.

Limited-Scale Batch Processing

Processing 1M documents at 512 tokens each = 512M tokens. The 4090 at 260 tok/s handles it in 548 hours. Cost: $186 on-demand or $121 spot.

Offline processing? Great. A100 would cost $654. The 4090 is 82% cheaper.

Limitations and Trade-offs

VRAM ceiling (24GB): 13B+ models need quantization. 70B requires aggressive 2-4-bit (quality loss). Bad for sensitive work (medical, legal).
Bandwidth bottleneck: Training is slow. GDDR6X (1,008 GB/s) is ~48% slower than A100 (1,935 GB/s). Batch 64+ causes memory contention and latency spikes.
Thermal throttling: GDDR6X heats up. After 10-15 minutes at full load, clocks drop 15-25%, throughput falls. Data center GPUs have better thermal design.
Not for production: Latency is 2-3x higher than H100. Throughput is lower. Consumer GPU, not datacentre-spec. Thermal issues make 24/7 operation risky. Use for dev and low-volume inference only.
Memory type matters: GDDR6X is for graphics. HBM2e is for compute. Performance gap widens with bigger models and batch sizes.

Cost vs A100 and H100

Hourly Rate

GPU	On-Demand	Spot	Monthly (On-Demand)	Annual
RTX 4090	$0.34	$0.22	$248	$2,976
A100 PCIe	$1.19	$1.19*	$869	$10,425
H100 PCIe	$1.99	$1.99*	$1,453	$17,436

*A100/H100 shown at RunPod spot rates.

The 4090 on-demand is 71% cheaper than A100 spot. Spot rates make A100/H100 cheaper per hour than 4090 on-demand (but spot is preemptible). Apples-to-apples: 4090 ($0.34) vs Lambda A100 ($1.48) is 77% cheaper.

Cost Per Task (Inference, 7B Model, 1M Tokens)

GPU	Throughput	Time	Cost (On-Demand)	Cost Per Token
RTX 4090	260 tok/s	1.2 hrs	$0.41	$414/M tokens
A100 PCIe	300 tok/s	1.0 hrs	$1.19	$1,190/M tokens
H100 PCIe	850 tok/s	0.33 hrs	$0.66	$660/M tokens

The 4090 wins on cost-per-token for batch (spot: $0.22/hr = $254/M tokens). H100 wins on speed (12x faster, but 1.6x higher cost-per-token).

Cost Per Task (Fine-Tuning, 7B Model, 100K Examples)

GPU	Training Time	Cost (On-Demand)	Cost (Spot)
RTX 4090	30-40 hrs	$10.20-$13.60	$6.60-$8.80
A100 PCIe	18-22 hrs	$21.42-$26.18	$21.42-$26.18
H100 PCIe	6-7 hrs	$11.94-$13.93	$11.94-$13.93

H100 is cheaper per-task on fine-tuning (despite 6x higher hourly rate) due to 5x faster throughput. 4090 is comparable cost ($8-14 vs $12-14), but takes 5-6x longer.

Real-World Workload Examples

Example 1: Document Processing Service

Company: document intelligence startup. Processing 50K PDFs/month, 1,000 tokens each (OCR + content extraction).

Monthly tokens: 50M. Need to complete within 30 days.

RTX 4090 on-demand ($0.34/hr):

Throughput: 260 tok/s
Hours needed: 50M / 260 = 192,308 seconds = 53.4 hours
Cost: 53.4 hrs × $0.34 = $18.16/month
Margin: $500/month (assuming customer pays $20-50/doc = $1,000-2,500/month revenue)
Profitability: high

RTX 4090 spot ($0.22/hr):

Cost: 53.4 hrs × $0.22 = $11.75/month (with preemption tolerance)
Margin: $500/month (higher if spot saves achieved)
Profitability: excellent

A100 on-demand ($1.48/hr via Lambda):

Cost: 167M / 300 = 166,667 seconds = 46.3 hours
Cost: 46.3 hrs × $1.48 = $68.51/month
Margin: $500/month (still profitable but narrow)
Profitability: lower margin

The 4090 is right for this workload. Cost-per-token wins. 10% speed disadvantage gets absorbed by overnight batch processing.

Example 2: Internal LLM API (Low Volume)

Company: 10 developers, 100 API calls/day, 200 input + 150 output tokens per call.

Daily: 10 × 100 × (200 + 150) = 350K tokens Monthly: 10.5M tokens

RTX 4090 on-demand ($0.34/hr):

Keep instance running 24/7
Cost: $0.34 × 730 = $248/month
Token capacity: 260 tok/s × 86,400 sec/day = 22.4M tokens/day (more than enough)
Utilization: 10.5M / (22.4M × 30) = 1.6% (very low)

Lambda A100 on-demand ($1.48/hr):

Keep instance running 24/7
Cost: $1.48 × 730 = $1,080/month
Token capacity: 300 tok/s × 86,400 = 25.9M tokens/day
Utilization: 10.5M / (25.9M × 30) = 1.4%

The 4090 is 78% cheaper. Both are over-provisioned. Consider an API instead (Mistral Small at $0.10/M input + $0.30/M output = $2-3/month).

FAQ

Is RTX 4090 suitable for production LLM serving?

Not ideal. Latency is 2-3x higher than A100. Throughput is lower. VRAM caps at 13B without quantization. Thermal throttling happens. For production, use A100 or H100. For prototypes or low-volume (<1M tokens/day), 4090 works.

Can I run a 70B model on RTX 4090?

Technically yes, with 2-4-bit quantization. 70B at 2-bit = ~18GB. But quality degrades (perplexity +10-50%). Avoid for quality-sensitive work. For summarization/extraction, 2-4-bit works fine.

How does RTX 4090 compare to RTX 3090?

The 4090 is 1.5-2x faster (same 24GB VRAM, better architecture). RTX 3090 on RunPod is $0.22/hr vs $0.34 for the 4090, making the 4090 about 55% more expensive. Pick 4090 for speed. 3090 is the budget option for cost-sensitive workloads.

Should I buy or rent RTX 4090?

Retail: $1,600-2,000. Cloud on-demand: $0.34/hr. Breakeven: 4,700-5,900 hours (6-8 months at 24/7). >50% utilization for 12+ months? Buy. Sporadic or <12 months? Rent.

Home cost: ~$80/month power (320W × 730 hrs × $0.15/kWh). Annual: $2,000 purchase + $960 power = $2,960. Cloud annual: $2,976 on-demand or $1,920 spot. Roughly the same if you use it consistently.

Can I use RTX 4090 for video processing?

Yes. Great for rendering and NVENC (video codec acceleration). 10-15x faster than CPU. Popular for batch transcoding. At $0.34/hr, 1TB overnight is cheap. Monthly: $248.

Does RTX 4090 support TensorRT and model optimization?

Yes. TensorRT, ONNX, and quantization frameworks work. Quantize to fit 24GB, then optimize speed with TensorRT.

What is the memory bandwidth of RTX 4090?

1,008 GB/s (GDDR6X). A100 is 1,935 GB/s (1.9x faster). H100 is 3,350 GB/s (3.3x faster). Bandwidth matters for training. For inference, less critical (amortized across multiple passes).

Can I use RTX 4090 for multi-GPU training?

Not recommended. No NVLink. PCIe between GPUs is 50x slower (12.8 GB/s vs 900 GB/s NVLink). Multi-GPU training is impractical (overhead kills compute).

Single-GPU fine-tuning works. Multi-GPU fine-tuning (gradient reduction) is slow.

How does RTX 4090 cloud compare to home RTX 4090?

Home: $1,800 initial + $80-100/month power (320-400W × 730 hrs × $0.15/kWh) = $960-1,200/year. Cloud: $0.34/hr = $2,976/year continuous, or $248/month average.

Home is cheaper if >50% utilization continuous. Cloud wins for sporadic (<25%). Breakeven: ~50% utilization, 12 months, $2,060-2,400 total.

What thermal issues should I be aware of?

GDDR6X heats up under sustained load. Running above 80°C (typical at 320W) triggers throttling. Clock speeds drop 15-25%, throughput falls. Data center GPUs (A100, H100) have better thermal design. For 24/7, A100 is more reliable. For dev/test (8-10 hrs/day), thermal issues are minimal.

Sources

NVIDIA RTX 4090 Specifications
RunPod GPU Pricing
DeployBase GPU Pricing Tracker (March 2026 observations)

Contents