Contents
- Overview
- RTX 4090 Cloud Pricing
- RTX 4090 Specifications
- VRAM and Model Fit
- Inference Performance
- Training Performance
- Cost Per Token Analysis
- When to Use RTX 4090
- Cost vs A100 and H100
- Real-World Workload Examples
- FAQ
- Related Resources
- Sources
Overview
The RTX 4090 costs $0.34/hr on-demand on RunPod. Spot: $0.22/hr (35% off). 24GB GDDR6X, built for gaming but works for LLM inference and light training.
It's cheap, which is why budget teams use it. But there are tradeoffs: bandwidth is 50% slower than A100/H100. Training speed suffers. Batch sizes are limited. And GDDR6X heats up under sustained load, triggering thermal throttling during heavy inference.
RTX 4090 Cloud Pricing
| Provider | GPU | VRAM | Spot/hr | On-Demand/hr | Monthly (On-Demand) | Annual |
|---|---|---|---|---|---|---|
| RunPod | RTX 4090 | 24GB | $0.22 | $0.34 | $248 | $2,976 |
Data as of March 2026. RunPod is the primary cloud provider offering RTX 4090 rental. Vast.AI has some RTX 4090 listings from individual miners, but pricing is unstable (ranges $0.15-$0.40/hr depending on supply, time of day, and miner demand).
RTX 4090 Specifications
| Spec | RTX 4090 | A100 | H100 | B200 | Advantage |
|---|---|---|---|---|---|
| Memory | 24GB GDDR6X | 80GB HBM2e | 80GB HBM3 | 192GB HBM3e | A100/H100 (3.3-7.5x) |
| Bandwidth | 1,008 GB/s | 1,935 GB/s | 3,350 GB/s | 8,533 GB/s | B200 (8.5x) |
| Peak FP32 | 82.6 TFLOPS | 19.5 TFLOPS | 67 TFLOPS | 660 TFLOPS | B200 (8x) |
| TF32 Tensor | 331 TFLOPS | 312 TFLOPS | 1,457 TFLOPS | 10,560 TFLOPS | B200 (32x) |
| FP8 Tensor | Not native | Not native | 5,825 TFLOPS | 42,240 TFLOPS | B200 (only) |
| Memory Type | GDDR6X | HBM2e | HBM3 | HBM3e | HBM favors training |
| TDP | 450W | 400W | 700W | 1,000W | RTX 4090 (lowest) |
| Price/hr (On-Demand) | $0.34 | $1.19-$1.48 | $1.99-$3.78 | $5.98-$6.08 | RTX 4090 (cheapest) |
The 4090 is a gaming card. Memory bandwidth (1,008 GB/s) optimized for graphics, not tensors. A100 is 1.9x faster at memory bandwidth. H100 is 3.3x faster. B200 is 8.5x faster. For inference workloads where throughput isn't the bottleneck, cheaper per-hour pricing makes up the gap.
VRAM and Model Fit
24GB is the critical constraint. Models that fit:
Native Precision (FP32 or FP16):
- 7B parameter models: ~14-16GB (4 bytes per parameter in float32)
- 13B models: ~26-30GB (exceeds VRAM, requires quantization)
4-bit Quantization (QLoRA, bitsandbytes):
- 7B models: ~4GB (1 byte per parameter + overhead)
- 13B models: ~8GB
- 30B models: ~18GB
- 70B models: ~35GB (exceeds VRAM, requires CPU offloading)
Inference with KV-Cache (batch=32):
- 7B models: ~10GB (model + KV cache + batch activations)
- 13B models: ~18GB
- 30B models: ~30GB+ (overflows, causes latency impact or OOM)
VRAM Pressure and Degradation: Operating within 20GB of 24GB causes memory pressure. Runtime performance degrades 10-20% as the OS swaps to slower memory. Running at 23GB (near-full) causes throttling, latency spikes.
The 24GB ceiling is the defining limitation. A100 has 80GB HBM2e (way faster memory). Use the 4090 for single-model inference (7B-13B max) or fine-tuning with aggressive quantization. Anything bigger needs A100 or H100.
Inference Performance
Throughput (Tokens Per Second)
Benchmark: Serving Mistral 7B on a single RTX 4090 with KV-cache optimization.
RTX 4090 (24GB GDDR6X):
- Batch size 1: ~140-160 tok/s
- Batch size 8: ~180-200 tok/s
- Batch size 32: ~240-280 tok/s
- Batch size 64: ~280-320 tok/s (VRAM pressure, latency impact)
A100 PCIe (80GB HBM2e):
- Batch size 32: ~280-320 tok/s
- Batch size 128: ~350-400 tok/s
H100 PCIe (80GB HBM3):
- Batch size 32: ~700-750 tok/s
- Batch size 128: ~850-1,000 tok/s
The 4090 is 15-30% slower per token (GDDR6X bandwidth limits, smaller memory). Gap narrows at tiny batch sizes. Single-digit batches actually favor 4090 due to lower overhead.
Latency (P50 Latency Per Token)
RTX 4090: 3-5ms per token (batch 8-32, no thermal throttling) RTX 4090 (throttled): 6-10ms per token (sustained load, GDDR6X heats up) A100: 2-3ms per token (batch 32-128) H100: 1.0-1.5ms per token (batch 128)
Latency is fine for batch jobs. Bad for interactive chat where <2ms matters.
Thermal throttling happens. The 4090 has a 450W TDP. After 10-15 minutes at full load, it hits 80-85°C. Thermal governors kick in, cut clocks, drop throughput 15-25%. A100 and H100 have better thermal design (built for 24/7 datacenters).
Cost Per Million Tokens
RTX 4090 inference (7B model, batch 32):
- Throughput: 260 tok/s
- Cost: $0.34/hr on-demand = $93.33 per 1M tokens (at full cost)
- Cost (spot): $0.22/hr = $60.77 per 1M tokens
A100 PCIe inference (7B model, batch 32):
- Throughput: 300 tok/s
- Cost: $1.19/hr = $396 per 1M tokens (RunPod spot)
H100 PCIe inference (7B model, batch 32):
- Throughput: 850 tok/s
- Cost: $1.99/hr = $233 per 1M tokens (RunPod spot)
The 4090 spot is 82% cheaper per token than H100 spot. For batch jobs, it wins on cost.
Training Performance
Fine-Tuning with LoRA
Training a 7B model with LoRA (parameter-efficient fine-tuning):
- Batch size: 16
- Training time: 30-40 hours
- Cost (on-demand): $10.20-$13.60
- Cost (spot): $6.60-$8.80
Same job on A100 (spot): 20 hours, $23.80 cost.
The 4090 is slower (1.5-2x) but cheaper ($7-14 vs $24). Budget-constrained teams win on total cost. If speed matters (weekly iterations), A100 is better.
Pre-training (Larger Models)
Pre-training from scratch? Impractical. Bandwidth bottleneck makes 13B+ models inefficient. No NVLink on the 4090, so distributed training is slow (PCIe is 50x slower).
Single 4090 trains 7B at ~300 samples/sec (batch 32). 1T tokens would take ~6 hours (unrealistic for real pre-training). Real pre-training needs 8-16 GPUs. PCIe between 4090s (12.8 GB/s) kills it.
Don't use 4090 for pre-training. Go A100 clusters minimum.
Cost Per Token Analysis
Batch Inference Workload (5M tokens/day)
Processing documents, logs, data annotation. Non-interactive, latency-flexible.
RTX 4090 setup (spot, $0.22/hr):
- Throughput: 260 tok/s aggregate
- Time to process 5M tokens: 19,231 seconds = 5.34 hours/day
- Cost per day: 5.34 hrs × $0.22 = $1.17
- Monthly: $35
- Annual: $426
A100 setup (spot, 1x $1.19/hr):
- Throughput: 300 tok/s
- Time to process 5M tokens: 16,667 seconds = 4.63 hours/day
- Cost per day: 4.63 hrs × $1.19 = $5.51
- Monthly: $165
- Annual: $1,980
The 4090 is 82% cheaper annually. 7% slower throughput, but cost wins.
Interactive API Inference (100K tokens/day)
Low-throughput interactive service. Latency matters.
RTX 4090 (on-demand, $0.34/hr):
- Throughput: 260 tok/s
- Time to process 100K tokens: 385 seconds = 6.4 minutes/day
- Running cost (keep instance warm 24/7): $0.34/hr × 730 = $248/month
A100 (on-demand, $1.48/hr, Lambda):
- Throughput: 300 tok/s
- Time to process 100K tokens: 333 seconds = 5.5 minutes/day
- Running cost (keep instance warm 24/7): $1.48/hr × 730 = $1,080/month
The 4090 is 77% cheaper monthly. Latency is 2ms slower (noticeable but fine for non-real-time APIs).
When to Use RTX 4090
Budget-Conscious Batch Inference
Running 7B-13B models where latency doesn't matter. Batch jobs, overnight processing, document analysis. Cost-per-token favors 4090. Monthly: $248 on-demand or $160 spot (730 hrs continuous).
Stack 3-4 4090s for parallel batch work: ~$744-992/month on-demand. Cheaper than one A100 at $869/month spot, similar throughput.
Lightweight Fine-Tuning
Single 7B fine-tuning run: $7-14 spot, $10-13.60 on-demand. Slower (30-40 hrs vs 6-7 on H100), but acceptable for one-off jobs or R&D.
Fine-tune LoRA on 4090, deploy on same GPU for inference. Full pipeline cost: $20-30 per experiment.
Prototyping and Experimentation
Testing a new inference optimization? Spin up 4090, test, iterate, kill. Cost-of-failure is low at $0.34/hr on-demand or $0.22/hr spot. Budget 20-30 experimental runs at $10 each = $200-300 total. H100 costs $3.98 per experiment = $80-120 for 20 runs. 4090 wins for low-risk iteration.
Development Workloads
Loading model weights, testing pipelines, validating datasets. 4090 works for dev (not production). Reserve H100 for production. Development cost: $3-5/day on 4090 vs $30-50/day on H100.
Limited-Scale Batch Processing
Processing 1M documents at 512 tokens each = 512M tokens. The 4090 at 260 tok/s handles it in 548 hours. Cost: $186 on-demand or $121 spot.
Offline processing? Great. A100 would cost $654. The 4090 is 82% cheaper.
Limitations and Trade-offs
-
VRAM ceiling (24GB): 13B+ models need quantization. 70B requires aggressive 2-4-bit (quality loss). Bad for sensitive work (medical, legal).
-
Bandwidth bottleneck: Training is slow. GDDR6X (1,008 GB/s) is ~48% slower than A100 (1,935 GB/s). Batch 64+ causes memory contention and latency spikes.
-
Thermal throttling: GDDR6X heats up. After 10-15 minutes at full load, clocks drop 15-25%, throughput falls. Data center GPUs have better thermal design.
-
Not for production: Latency is 2-3x higher than H100. Throughput is lower. Consumer GPU, not datacentre-spec. Thermal issues make 24/7 operation risky. Use for dev and low-volume inference only.
-
Memory type matters: GDDR6X is for graphics. HBM2e is for compute. Performance gap widens with bigger models and batch sizes.
Cost vs A100 and H100
Hourly Rate
| GPU | On-Demand | Spot | Monthly (On-Demand) | Annual |
|---|---|---|---|---|
| RTX 4090 | $0.34 | $0.22 | $248 | $2,976 |
| A100 PCIe | $1.19 | $1.19* | $869 | $10,425 |
| H100 PCIe | $1.99 | $1.99* | $1,453 | $17,436 |
*A100/H100 shown at RunPod spot rates.
The 4090 on-demand is 71% cheaper than A100 spot. Spot rates make A100/H100 cheaper per hour than 4090 on-demand (but spot is preemptible). Apples-to-apples: 4090 ($0.34) vs Lambda A100 ($1.48) is 77% cheaper.
Cost Per Task (Inference, 7B Model, 1M Tokens)
| GPU | Throughput | Time | Cost (On-Demand) | Cost Per Token |
|---|---|---|---|---|
| RTX 4090 | 260 tok/s | 1.2 hrs | $0.41 | $414/M tokens |
| A100 PCIe | 300 tok/s | 1.0 hrs | $1.19 | $1,190/M tokens |
| H100 PCIe | 850 tok/s | 0.33 hrs | $0.66 | $660/M tokens |
The 4090 wins on cost-per-token for batch (spot: $0.22/hr = $254/M tokens). H100 wins on speed (12x faster, but 1.6x higher cost-per-token).
Cost Per Task (Fine-Tuning, 7B Model, 100K Examples)
| GPU | Training Time | Cost (On-Demand) | Cost (Spot) |
|---|---|---|---|
| RTX 4090 | 30-40 hrs | $10.20-$13.60 | $6.60-$8.80 |
| A100 PCIe | 18-22 hrs | $21.42-$26.18 | $21.42-$26.18 |
| H100 PCIe | 6-7 hrs | $11.94-$13.93 | $11.94-$13.93 |
H100 is cheaper per-task on fine-tuning (despite 6x higher hourly rate) due to 5x faster throughput. 4090 is comparable cost ($8-14 vs $12-14), but takes 5-6x longer.
Real-World Workload Examples
Example 1: Document Processing Service
Company: document intelligence startup. Processing 50K PDFs/month, 1,000 tokens each (OCR + content extraction).
Monthly tokens: 50M. Need to complete within 30 days.
RTX 4090 on-demand ($0.34/hr):
- Throughput: 260 tok/s
- Hours needed: 50M / 260 = 192,308 seconds = 53.4 hours
- Cost: 53.4 hrs × $0.34 = $18.16/month
- Margin: $500/month (assuming customer pays $20-50/doc = $1,000-2,500/month revenue)
- Profitability: high
RTX 4090 spot ($0.22/hr):
- Cost: 53.4 hrs × $0.22 = $11.75/month (with preemption tolerance)
- Margin: $500/month (higher if spot saves achieved)
- Profitability: excellent
A100 on-demand ($1.48/hr via Lambda):
- Cost: 167M / 300 = 166,667 seconds = 46.3 hours
- Cost: 46.3 hrs × $1.48 = $68.51/month
- Margin: $500/month (still profitable but narrow)
- Profitability: lower margin
The 4090 is right for this workload. Cost-per-token wins. 10% speed disadvantage gets absorbed by overnight batch processing.
Example 2: Internal LLM API (Low Volume)
Company: 10 developers, 100 API calls/day, 200 input + 150 output tokens per call.
Daily: 10 × 100 × (200 + 150) = 350K tokens Monthly: 10.5M tokens
RTX 4090 on-demand ($0.34/hr):
- Keep instance running 24/7
- Cost: $0.34 × 730 = $248/month
- Token capacity: 260 tok/s × 86,400 sec/day = 22.4M tokens/day (more than enough)
- Utilization: 10.5M / (22.4M × 30) = 1.6% (very low)
Lambda A100 on-demand ($1.48/hr):
- Keep instance running 24/7
- Cost: $1.48 × 730 = $1,080/month
- Token capacity: 300 tok/s × 86,400 = 25.9M tokens/day
- Utilization: 10.5M / (25.9M × 30) = 1.4%
The 4090 is 78% cheaper. Both are over-provisioned. Consider an API instead (Mistral Small at $0.10/M input + $0.30/M output = $2-3/month).
FAQ
Is RTX 4090 suitable for production LLM serving?
Not ideal. Latency is 2-3x higher than A100. Throughput is lower. VRAM caps at 13B without quantization. Thermal throttling happens. For production, use A100 or H100. For prototypes or low-volume (<1M tokens/day), 4090 works.
Can I run a 70B model on RTX 4090?
Technically yes, with 2-4-bit quantization. 70B at 2-bit = ~18GB. But quality degrades (perplexity +10-50%). Avoid for quality-sensitive work. For summarization/extraction, 2-4-bit works fine.
How does RTX 4090 compare to RTX 3090?
The 4090 is 1.5-2x faster (same 24GB VRAM, better architecture). RTX 3090 on RunPod is $0.22/hr vs $0.34 for the 4090, making the 4090 about 55% more expensive. Pick 4090 for speed. 3090 is the budget option for cost-sensitive workloads.
Should I buy or rent RTX 4090?
Retail: $1,600-2,000. Cloud on-demand: $0.34/hr. Breakeven: 4,700-5,900 hours (6-8 months at 24/7). >50% utilization for 12+ months? Buy. Sporadic or <12 months? Rent.
Home cost: ~$80/month power (320W × 730 hrs × $0.15/kWh). Annual: $2,000 purchase + $960 power = $2,960. Cloud annual: $2,976 on-demand or $1,920 spot. Roughly the same if you use it consistently.
Can I use RTX 4090 for video processing?
Yes. Great for rendering and NVENC (video codec acceleration). 10-15x faster than CPU. Popular for batch transcoding. At $0.34/hr, 1TB overnight is cheap. Monthly: $248.
Does RTX 4090 support TensorRT and model optimization?
Yes. TensorRT, ONNX, and quantization frameworks work. Quantize to fit 24GB, then optimize speed with TensorRT.
What is the memory bandwidth of RTX 4090?
1,008 GB/s (GDDR6X). A100 is 1,935 GB/s (1.9x faster). H100 is 3,350 GB/s (3.3x faster). Bandwidth matters for training. For inference, less critical (amortized across multiple passes).
Can I use RTX 4090 for multi-GPU training?
Not recommended. No NVLink. PCIe between GPUs is 50x slower (12.8 GB/s vs 900 GB/s NVLink). Multi-GPU training is impractical (overhead kills compute).
Single-GPU fine-tuning works. Multi-GPU fine-tuning (gradient reduction) is slow.
How does RTX 4090 cloud compare to home RTX 4090?
Home: $1,800 initial + $80-100/month power (320-400W × 730 hrs × $0.15/kWh) = $960-1,200/year. Cloud: $0.34/hr = $2,976/year continuous, or $248/month average.
Home is cheaper if >50% utilization continuous. Cloud wins for sporadic (<25%). Breakeven: ~50% utilization, 12 months, $2,060-2,400 total.
What thermal issues should I be aware of?
GDDR6X heats up under sustained load. Running above 80°C (typical at 320W) triggers throttling. Clock speeds drop 15-25%, throughput falls. Data center GPUs (A100, H100) have better thermal design. For 24/7, A100 is more reliable. For dev/test (8-10 hrs/day), thermal issues are minimal.
Related Resources
- NVIDIA GPU Pricing Comparison
- NVIDIA RTX 4090 Specifications
- NVIDIA H100 Price Guide
- NVIDIA A100 Price Guide
- RunPod GPU Pricing
Sources
- NVIDIA RTX 4090 Specifications
- RunPod GPU Pricing
- DeployBase GPU Pricing Tracker (March 2026 observations)