Replicate Pricing Breakdown: Cost Per Token, Model Comparison & Hidden Fees

Deploybase · July 9, 2025 · GPU Pricing

Contents

Overview

Replicate Pricing is the focus of this guide. Replicate doesn't bill per token like OpenAI or Anthropic. It bills per prediction-basically per second of GPU time. That's a different mental model than token-based LLM APIs. The platform handles infrastructure; developers call models over HTTP. That convenience comes at a cost.

Cost varies wildly depending on model, hardware, and prediction duration. Stable Diffusion costs cents. Running a 70B model for 5 minutes costs $50+. Developers get flexibility, but developers need to watch the meter.

Replicate Pricing Model {#pricing-model}

There's no rate card here. Pricing depends on hardware. When a prediction runs, Replicate reserves a GPU and bills for the time it uses.

The formula is simple: Prediction Cost = GPU Rate × Duration in Seconds

Multiple hardware tiers available, CPU to H100. March 2026 rates:

HardwareHourly Rate (Approx)Cost Per Second
CPU~$0.001$0.000350
Nvidia T4$0.81$0.000225
Nvidia A100 (80GB)$5.04$0.001400
Nvidia H100$5.49$0.001525
Nvidia L40S$3.51$0.000975

These are current listed rates from Replicate's pricing page. Actual costs depend on Replicate's current capacity and demand. Infrastructure allocation is dynamic, so costs vary run-to-run.

Cost Per Prediction Explained {#cost-per-prediction}

Let's walk through real costs.

Image Generation (Stable Diffusion)

Stable Diffusion: 8-15 seconds on T4. Let's say 10 seconds.

Cost: 10 × $0.000225/sec = $0.00225/image

100 daily images:

  • Daily: $0.225
  • Monthly: ~$6.75
  • Annual: ~$82

CPU-only inference takes longer but costs less per image. Trade-off: speed vs cost.

Language Model Inference

Llama 2 7B takes 20-40 seconds on T4 depending on output length. Average: 30 seconds.

Cost: 30 × $0.000225/sec = $0.00675/prediction (on T4)

On A100 (faster, ~10 seconds): 10 × $0.001400/sec = $0.014/prediction

500 daily predictions (T4):

  • Daily: $3.38
  • Monthly: ~$101
  • Annual: ~$1,234

That adds up. For high-volume text work, OpenAI or token-based APIs often cost less.

Video Processing

Video is expensive. A 60-second video on A100 takes 120-180 seconds to process. Call it 150 seconds.

Cost: 150 × $0.001400/sec = $0.21/video (A100)

20 daily videos:

  • Daily: $4.20
  • Monthly: ~$126
  • Annual: ~$1,533

A 10-second input video still costs $2.50 if processing takes 120 seconds. Only viable for small batches or as a fallback.

Hardware Selection and Cost {#hardware-selection}

Hardware choice is the main cost lever. Better GPUs run faster, but speed gains don't match cost scaling.

Stable Diffusion on Different GPUs

Same model, 512×512 image generation:

GPUTime to CompleteApprox Cost
CPU180 seconds~$0.06
T410 seconds$0.002
A1005 seconds$0.007
H1003 seconds$0.005

T4 provides excellent value for image generation at $0.000225/sec. A100 at $0.001400/sec is faster but more expensive. H100 at $0.001525/sec is the fastest option. T4 wins on pure value. H100 wins if latency matters-faster execution means better user experience and higher throughput.

Batch Processing vs Real-Time

Replicate bills per second. 30-second prediction = 30 seconds of cost, always. Peak or 3am, same price.

But latency varies. T4 might queue 5 seconds, H100 queues 2 seconds. If developers have batch flexibility (reports, overnight work, periodic analysis), run cheap. If users are waiting, developers need fast hardware.

Comparison with RunPod Serverless {#runpod-comparison}

RunPod costs less but demands more work.

Pricing: RunPod A100 = $1.19/hour. Replicate A100 80GB = ~$5.04/hour. Replicate's premium reflects managed infrastructure and zero-ops serverless model.

Operating models are different though:

Replicate:

  • Zero infrastructure work
  • Pre-built models (Stable Diffusion, Llama, etc.)
  • Auto-scaling, load balancing
  • Cold start: 2-5 seconds
  • Good for: Speed matters more than marginal cost

RunPod:

  • Developers containerize and manage everything
  • Developers handle scaling and resources
  • Cold start: 5-30 seconds
  • Lower cost on average
  • Good for: Volume justifies the overhead

100 Stable Diffusion predictions daily? Replicate costs $4-5/day, zero setup. RunPod costs $2-3/day but needs containerization and scaling logic-20+ hours of engineering. Break-even is 1,000+ daily predictions where RunPod's savings justify the work.

Replicate hosts hundreds of models. Common ones and their costs:

ModelTaskHardwareTimeApprox Cost
Stable Diffusion 3Image GenerationT415s$0.003
Llama 2 7BText GenerationT425s$0.006
Whisper (Large)Speech-to-TextT430s$0.007
DALL-E 3Image GenerationA10020s$0.028
Mistral 7BText GenerationA10015s$0.021
Flux ProImage GenerationH10010s$0.015

These are ballpark figures. Actual costs vary with input size. A 2-hour Whisper file takes way longer than 30 seconds. 500-token Llama output takes longer than 50 tokens.

Key difference from LLM APIs: Developers pay for GPU time, not tokens. Fast inference is cheap. Slow, token-heavy work is expensive.

Billing Mechanics and Optimization {#billing-mechanics}

How Replicate Actually Charges

Prediction starts = billing starts. Replicate reserves a GPU. Prediction completes = billing stops. Failures get refunded for unused time.

Example: H100 reservation. Prediction runs 45 seconds, succeeds. Bill: 45 × $0.001525/sec = $0.069.

Same prediction fails at 30 seconds. Bill: 30 × $0.001525/sec = $0.046 for those 30 seconds only.

Cold Start Costs

Model scales to zero, next request is a "cold start." GPU boots, container loads, model weights initialize. Adds 2-10 seconds depending on model and GPU.

Frequent workloads? Negligible. Sparse requests? Adds up. 10 cold starts daily at 5 seconds each = 50 seconds extra (~$0.011 on T4 at $0.000225/sec) or ~$4/year.

Optimization Techniques

  1. Batch Predictions: Send 10 images at once instead of 10 requests. Batch processing reduces cold start overhead.

  2. Keep-Warm Patterns: Send a dummy prediction every 10-15 minutes on critical workloads. Cost is negligible vs. User wait times.

  3. Hardware Right-Sizing: If T4 finishes in 5 seconds, upgrading to A100 (2x faster) saves nothing per prediction. Optimize the model or quantize instead.

  4. Async Processing: Flexible latency? Route overnight reports to CPU instead of T4. Saves 99% of cost.

  5. Local Caching: Cache predictions locally. A common image prompt generates once and reuses across sessions.

Scale Analysis {#scale-analysis}

Economics shift at different scales. Cheap at 10 daily predictions. Expensive at 1,000.

Feasibility Thresholds

Below 50 daily predictions: Replicate's convenience wins. 20 images/day = $1-2/day (~$30-60/month). Engineering to build alternatives costs thousands.

500-2,000 daily: Cost starts to matter. 1,000 images/day on T4 = ~$2.25/day or ~$68/month. Time to evaluate alternatives at scale.

10,000+ daily: Self-hosted or committed GPUs probably cheaper. Replicate per-second billing loses to long-term leases, but with operational overhead.

Volume Discounts

Replicate doesn't advertise volume deals, but call sales. 50,000 daily predictions? Negotiate. Public pricing is mid-market.

Workload Characteristics

Different workloads fit differently:

  • Bursty: Traffic spikes randomly. Replicate auto-scales. Fixed infrastructure over-provisions. Replicate wins.
  • Sparse: Few requests, latency flexible. Replicate handles cold starts. Self-hosted wins only if shared with other workloads.
  • Stable: Predictable, steady. Lease fixed or commit. Self-hosted likely wins on cost.
  • Multi-model: Different models per request. Replicate switches models instantly. Self-hosted needs multiple deployments. Replicate wins operationally, costs might be higher.

Real-World Pricing Examples {#examples}

Example 1: AI Image Generator SaaS

Startup builds a product mockup generator using Stable Diffusion.

Assumptions:

  • 500 daily active users
  • 2 images per user/day (1,000 total)
  • 15 seconds per image on T4

Cost: 1,000 × $0.003 (T4, 15s each) = $3/day

  • Daily: $3
  • Monthly: ~$90
  • Annual: ~$1,095

Revenue at $10/month per user = $5,000/month. After Replicate, ~$4,910 remains for hosting, support, and infrastructure. Good margins for AI.

Example 2: Batch Video Processing Platform

Company processes 100 videos/day (60 seconds each). Processing on A100 takes 120 seconds per video.

Cost: 120 × $0.001400/sec = $0.168/video

  • Daily: $16.80
  • Monthly: ~$504
  • Annual: ~$6,132

A leased H100 at ~$1,950/month gives continuous capacity. At high volume, self-hosted becomes economical.

Example 3: Internal Tool with Llama 2 7B

Company uses Llama 2 7B for internal NLP: email classification, entity extraction, summaries. 30 seconds per prediction on T4.

Cost: $0.00675/prediction (T4, 30s each)

  • Daily: 500 predictions = $3.38
  • Monthly: ~$101
  • Annual: ~$1,234

At scale, that cost justifies hiring someone to fine-tune and self-host Llama 2. Or switch to OpenAI token-based billing.

Hidden Costs and Considerations {#hidden-costs}

Per-second billing is transparent, but several things inflate effective cost.

Output Storage

Replicate stores outputs temporarily, but developers need persistent storage. 100 Stable Diffusion images daily needs somewhere to live.

Options:

  1. Download and store locally (management overhead)
  2. Cloud storage (AWS S3, etc.)

S3 costs $0.023/GB/month. 100 images × 2MB = 200GB/month = ~$5/month. Small but scales.

API Request Overhead

Each prediction is an API call. Validation, routing, logging add overhead. Already amortized in per-second price, so actual compute costs less. But the difference is negligible.

Regional Variation

Replicate operates globally but doesn't vary pricing by region. us-east costs the same as eu-west. Convenient but wasteful if most users are in one region. No optimization lever here.

Model-Specific Constraints

Some models have minimums or maximums:

  • 5-second minimum charge even if prediction finishes in 0.5 seconds
  • 5-minute timeout regardless of compute needs

Constraints vary by model. Review docs before committing.

Dependency on Third-Party Models

Replicate hosts models but doesn't version tightly. Models get updated or deprecated. Cost and behavior might shift. No cost stability guarantee over time.

FAQ

Does Replicate charge during GPU startup time?

Yes, billing starts when the GPU boots if the model cold-starts. This is typically 2-5 seconds. To minimize cold start costs, keep models warm with periodic predictions or batch incoming requests.

Can I estimate prediction time before running?

Replicate provides model-specific documentation with typical runtime ranges. However, actual time varies with input complexity. A 512×512 image prediction takes longer than a 256×256 one. A Whisper prediction on 2 hours of audio takes much longer than 30 seconds.

What happens if my prediction times out?

Each model has a timeout limit (varies by model, typically 5-30 minutes). If a prediction exceeds the timeout, it terminates and Replicate bills only for the time consumed.

Should I use Replicate or build my own GPU infrastructure?

Use Replicate if: Your volume is low (<1,000 predictions/day), you need quick time-to-market, or you want zero infrastructure overhead. Build in-house if: Your volume is high (>10,000 predictions/day), costs are critical, or you need custom model deployment.

How does Replicate compare to Lambda Labs?

Replicate is managed serverless inference; you pay per prediction. Lambda Labs is raw GPU rental; you pay per hour whether the GPU is used or idle. For steady-state, high-volume workloads, Lambda is cheaper. For variable or bursty workloads, Replicate is more economical.

Can I use Replicate for batch processing?

Yes. Replicate supports synchronous (request-response) and asynchronous webhooks. Async is better for batch work: submit 1,000 jobs, they execute in parallel, and Replicate POSTs results to a webhook endpoint as they complete.

Sources