Replicate GPU Cloud Pricing: Complete Guide vs Hourly Rates for Every GPU

Overview
Replicate Model Pricing Reference Guide
Hardware and Costs
Economics Analysis
API-First Inference Model
FAQ
Related Resources
Sources

Overview

Replicate bills per second of inference. No hourly reservations. No idle charges. That's fundamentally different from traditional cloud GPU providers.

Per-Second Billing

Replicate bills all workloads per second, rounded to the nearest second. Minimum charges apply to keep the platform running.

Typical rates hit $0.001 to $0.10 per second depending on hardware. Which hardware developers get depends on the model-developers don't pick it directly.

No Idle Time Charges

Replicate doesn't charge while requests sit in queue. Shared GPU pools mean no per-customer resource reservation.

This works best for variable workloads. Batch processing particularly benefits since developers avoid the dead time cost of dedicated hardware.

Replicate Model Pricing Reference Guide

Popular Model Costs

Here's what common models cost:

Model	Task	Typical Latency	Cost Per Request	Monthly Cost (1000 daily)
Llama 2 7B	Text completion	5s	$0.0009	$27
Llama 2 13B	Text completion	8s	$0.0018	$54
Llama 2 70B	Text completion	15s	$0.0035	$105
Mistral 7B	Text completion	5s	$0.0009	$27
Stable Diffusion 3	Image generation	12s	$0.0072	$216
DALL-E 3	Image generation	20s	$0.012	$360
Whisper Large	Speech-to-text	8s per min	$0.0018	$54
ControlNet	Image manipulation	10s	$0.006	$180

Volume Discount Structure

Replicate discounts scale with request volume:

Monthly Request Volume	Discount	Effective Pricing
<100,000	0%	Standard rates
100,000-1,000,000	5-10%	5-10% reduction
1,000,000-10,000,000	15-20%	15-20% reduction
10,000,000+	Custom	Contact sales

Hit 1M monthly predictions (about 33k daily) and developers get 15-20% savings.

Hardware and Costs

GPU Tier Selection

Replicate hides the hardware layer. Developers pick the model, it picks the GPU. No manual hardware selection.

Common assignments:

A40 (inference): Most models land here
A100 (standard): Compute-heavy work
H100 (premium): When latency or performance matter most

Lambda prices H100 SXM at $3.78/hour. Replicate's H100 runs $0.001525/second, or $5.49/hour continuous — significantly more expensive than Lambda for sustained use, but no idle charges on Replicate make it economical for variable workloads.

Model-Specific Costs

Bigger models cost more per request. More GPU memory means longer inference times. Simple math.

Llama 2 13B: ~$0.0018 per prediction (8-second latency)
Llama 2 70B: ~$0.0035 per prediction (15-second latency)
Stable Diffusion 3: ~$0.005-0.010 per prediction (10-25 second generation)

Economics Analysis

Inference Comparison

1000 Stable Diffusion 3 image generations:

Replicate: 1000 × $0.0075 = $7.50
Self-hosted H100: 10,000 seconds = 2.78 hours
Koyeb at $3.45/hour: 2.78 × $3.45 = $9.58

Replicate saves 21%.

RunPod H100 costs $2.69/hour though. 2.78 × $2.69 = $7.47. So RunPod saves $0.03-essentially tied.

Llama 2 70B completions at 15-second latency on H100 ($0.0014/sec) = $0.021 per request at H100 rates. The table's $0.0035 per prediction reflects A40 GPU pricing (lower-cost hardware). On H100, 1000 requests = $21.

70B needs significant VRAM. Let's compare at H100 rates:

RunPod H100 ($2.69/hr): 5.56 hours = $14.95
Replicate H100 ($0.0014/sec): 1000 × 15s × $0.0014 = $21.00

Developers pay 40% more for Replicate on H100. That's the price of not managing infrastructure.

Real-Time API Endpoint Economics

Text-to-image API with 50 daily requests:

Replicate: 50 × $0.0075 = $11.25/month
Self-hosted A100: ~$1,800/month (Koyeb)

Replicate is 0.6% of the cost.

Low-to-moderate volume? Replicate dominates. Dedicated hardware only makes sense above 2000+ monthly requests. Between 100-2000, Replicate lets developers launch without infrastructure overhead.

High-Volume Batch Processing Economics

100,000 Llama 2 70B completions:

Replicate: $350
RunPod: 250 hours × $2.69 = $672.50
Koyeb: 250 × $3.45 = $862.50

Replicate saves 48%.

Scale to 1M though, the math breaks. CoreWeave and Alibaba with commitment discounts flip it:

Replicate: $3,500
RunPod: $2,692.50
CoreWeave 8×H100: 50 hours × $49.24 = $2,462
Alibaba 4×H100 (discounted): 50 × $9.80 × 0.75 = $367.50

At million-scale, self-hosted infrastructure costs 90% less.

Hybrid Strategy for Variable Workloads

Mix providers to optimize both cost and complexity:

Prototyping: Replicate (no infrastructure)
Batch processing (500-5000 monthly): Koyeb with auto-scaling
Large batch (10,000+ monthly): RunPod or CoreWeave
Fine-tuning: Lambda or Nebius

This splits the load across the right provider for each job type.

API-First Inference Model

Replicate's Positioning in AI Stack

Replicate is the modern API-first inference play. Call HTTP endpoints instead of managing infrastructure. That abstraction matters.

Trade-off matrix:

Self-hosted: Full control, maximum hassle
Replicate: Simple, moderate costs
LLM APIs: Simplest, priciest

Replicate sits in the middle. Developers trade some cost for sanity.

Developer Experience Value

Replicate's API simplicity matters more than the spreadsheet shows. The team spends zero time on infrastructure. All energy goes to the product.

For prototyping and MVPs, that per-request model means developers iterate fast. No infrastructure commitments. For many teams, that speed premium is worth the cost.

Model Ecosystem

Thousands of pre-built models live on Replicate. Deploy them instantly. No training complexity.

That matters for teams without ML chops. Developers call an API and get SOTA results. That's not free, but it's valuable.

FAQ

Q: Can I fine-tune models on Replicate? A: Replicate supports model fine-tuning through their training API. Training costs run $0.0025 per second per GPU, similar to inference pricing.

Q: Does Replicate provide guaranteed latency SLAs? A: Replicate does not publish latency SLAs. Typical p95 latency ranges from 100-500ms depending on queue depth and model complexity.

Q: Can I deploy custom models on Replicate? A: Yes. Custom models require Docker containerization and Cog definition files. Standard models like Llama and Stable Diffusion are pre-deployed.

Q: What's the maximum request concurrency? A: Replicate auto-scales based on incoming requests. Peak concurrency depends on subscription tier and resource availability.

Q: Does Replicate offer discounts for high-volume usage? A: Replicate provides custom production pricing for 1M+ monthly predictions. Contact their sales team for quote.

Sources

Replicate official pricing page (as of March 2026)
Model inference latency benchmarks
Cloud GPU platform cost comparison studies
DeployBase infrastructure research

Contents