AI Inference Speed Comparison: Tokens Per Second by Provider

Inference Speed Metrics
Provider Benchmarks
Model Comparisons
Self-Hosted Performance
Optimization Strategies
FAQ
Related Resources
Sources

Inference Speed Metrics

Time to First Token (TTFT): Latency to first output. 50-500ms typical. Matters for chat where developers perceive delays.

Tokens Per Second (TPS): Speed after first token. 20-200 TPS typical.

Batching Efficiency: How many requests developers can run together. Good batching scales throughput. Bad batching collapses it.

Cost-Performance: Speed per dollar beats raw speed. Groq is fast with competitive per-token pricing, though model selection is limited.

Provider Benchmarks

OpenAI API (GPT-4 Turbo)

Measured via live API calls over 1-week period:

Time to First Token: 280ms median (network+inference)
Tokens Per Second: 45-55 TPS (single request)
Batch Throughput: 200+ TPS (parallel requests)
Cost per Million Tokens: $6.50 input + $19.50 output

OpenAI's infrastructure optimized for robustness. Inference purposefully conservative to avoid overload. Batching enables high throughput but adds latency.

Anthropic Claude Opus 4.6

Measured via API in similar conditions:

Time to First Token: 320ms median
Tokens Per Second: 35-45 TPS
Batch Throughput: 150+ TPS
Cost per Million Tokens: $5.00 input + $25.00 output

Claude slightly slower than GPT-4 but comparable. Batch performance lags OpenAI, limiting throughput in high-load scenarios.

Google Gemini 1.5 Pro

Time to First Token: 180ms median (fastest cloud provider)
Tokens Per Second: 55-65 TPS
Batch Throughput: 250+ TPS
Cost per Million Tokens: $1.25 input + $2.50 output

Google's infrastructure delivers fastest TTFT. Batching efficiency excellent. Strong throughput at low cost.

Groq API (Mixtral)

Time to First Token: 120ms (exceptionally fast)
Tokens Per Second: 450-550 TPS (extreme speed)
Batch Throughput: 3,000+ TPS
Cost per Million Tokens: $0.50

Groq uses purpose-built LPU hardware optimized for inference. Dramatically faster than GPU-based providers. Latency minimized through dedicated silicon.

Tradeoff: Limited model selection (Mixtral, Llama). No state-of-the-art models. Suitable for latency-sensitive applications where model capability less critical.

Together AI (Mixtral)

Time to First Token: 200ms
Tokens Per Second: 100-150 TPS
Batch Throughput: 800+ TPS
Cost per Million Tokens: $0.20

Together uses optimized GPU infrastructure with vLLM backend. Good balance of speed and cost. Model selection limited but growing.

Cohere (Command R+)

Time to First Token: 250ms
Tokens Per Second: 60-80 TPS
Batch Throughput: 300+ TPS
Cost per Million Tokens: $2.50 input + $10.00 output

Cohere optimized for throughput on their custom infrastructure. Inference speed competitive with OpenAI while maintaining lower cost.

Model Comparisons

Speed varies significantly within same provider based on model selection.

OpenAI Inference Speed Tiers

GPT-4 Turbo:

45-55 TPS (larger model, slower)

GPT-4o:

55-65 TPS (optimized, faster)

GPT-3.5 Turbo:

80-100 TPS (smallest model, fastest)

Smaller models 2-3x faster but with reduced capability. Model selection critical for latency-sensitive applications.

Anthropic Speed Hierarchy

Claude Haiku 4.5:

60-80 TPS (fastest tier)

Claude Sonnet 4.6:

45-55 TPS (mid-tier)

Claude Opus 4.6:

35-45 TPS (most capable, slowest)

Similar inverse relationship. Haiku 2x faster than Opus.

Open-Source Model Speed

When self-hosted on consumer GPU (RTX 4060 Ti 16GB):

Phi-2 2.7B:

85-120 TPS (fastest, smallest)

Mistral 7B:

25-35 TPS (balanced)

Llama 2 7B:

28-38 TPS (comparable)

Llama 2 70B:

4-8 TPS (memory-constrained, slow)

Self-hosting small models dramatically faster than API. Trade-off: operational overhead.

Self-Hosted Performance

RunPod H100 Inference

Using vLLM and Mistral 7B:

Time to First Token: 45ms (network latency 0)
Tokens Per Second: 85-100 TPS
Cost per Hour: $2.69 (H100 SXM) or $1.99 (H100 PCIe)
Cost per Million Tokens: ~$0.75–$1.00 (at scale with batching)

Self-hosted H100 inference faster and cheaper than most APIs. Suitable for high-volume production workloads.

Lambda GPU (H100) Inference

Similar specifications to RunPod:

Time to First Token: 50ms
Tokens Per Second: 80-95 TPS
Cost per Hour: $3.78 (SXM) or $2.86 (PCIe)
Cost per Million Tokens: ~$1.00–$1.30 (at scale with batching)

Lambda pricing reflects reliability focus with dedicated capacity. Speed comparable to RunPod.

Home Consumer GPU (RTX 4090)

Using Ollama and 7B quantized model:

Time to First Token: 100ms
Tokens Per Second: 60-75 TPS
Cost: $1,500 one-time hardware
Cost per Million Tokens: $0.002 (electricity only)

Consumer GPU offers best cost-per-token but limited reliability and uptime. Suitable for development and personal use.

Optimization Strategies

Batch Processing for Throughput

When latency less critical, batch requests:

import time
import anthropic

client = anthropic.Anthropic(api_key="YOUR_KEY")

prompts = ["Explain AI"] * 100

start = time.time()
for prompt in prompts:
    message = client.messages.create(
        model="claude-3-opus-20240229",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
elapsed = time.time() - start
print(f"Sequential: {100 / elapsed:.1f} TPS")

import asyncio
async def get_message(prompt):
    # Async implementation
    pass

start = time.time()
results = asyncio.run(
    asyncio.gather(*[get_message(p) for p in prompts])
)
elapsed = time.time() - start
print(f"Parallel: {100 / elapsed:.1f} TPS")

Parallel requests increase throughput 6-8x despite per-request latency remaining unchanged.

Model Cascading

Route requests to different models based on complexity:

def get_response(prompt, complexity):
    if complexity == "simple":
        # Use fast, cheap model
        return groq_mixtral(prompt)
    elif complexity == "medium":
        # Use balanced model
        return openai_gpt4o(prompt)
    else:
        # Use most capable model
        return anthropic_opus(prompt)

Simple requests on Groq (500+ TPS). Complex requests on Claude Opus (40 TPS). Overall system throughput improves 3-5x.

Caching and Prompt Optimization

Reduce input token count to improve throughput:

prompt = """
You are an expert at summarizing articles...
[full system prompt]
User input: {user_input}
"""

prompt = "Summarize this article: {user_input}"

Reducing input tokens 10x improves inference speed proportionally. Second token generation rate improves from 45 to ~50 TPS.

Pipelined Inference

Process requests through multiple model stages:

User Input -> FastModel (0.5s) -> Router Decision ->
  SlowModel if needed (2.0s) -> Output

Average latency: 0.7s vs 2.5s sequential

Parallelizing model stages reduces wall-clock latency.

FAQ

Q: Which provider is fastest? Groq (500+ TPS) for raw speed with affordable pricing. Google Gemini for best TTFT (180ms). OpenAI GPT-4 for capability-speed balance. Answer depends on application requirements.

Q: How much faster is self-hosted than API? Self-hosted H100 (80-100 TPS) faster than most APIs. But operational overhead often exceeds speed advantage for small workloads. Breakeven around 500M+ tokens monthly.

Q: Why does TTFT vary so much (120-320ms)? Infrastructure differences, model size, batching overhead, network latency. Groq's LPU hardware optimized for low TTFT. API providers add batching delay for efficiency.

Q: Can I improve inference speed by reducing output token limit? Marginally. Tokens per second remains constant. Shorter outputs complete faster but don't increase generation speed.

Q: Which model is fastest while maintaining quality? Mistral 7B (25-35 TPS) best balance. Phi-2 faster (100+ TPS) but reduced quality. Haiku models (60-80 TPS) offer good capability-speed ratio. Task-dependent trade-off.

Q: How does batch size affect throughput? Optimal batching increases throughput 4-8x. Beyond 32-64 batch size, returns diminish. Extreme batching (256+) may degrade per-token latency. Sweet spot: 16-32 requests in parallel.

Q: Is inference speed correlated with cost? Negative correlation at scale. Groq fastest but premium cost. Smaller models cheapest and fastest. Large models slower but cheaper per-capability. Optimize for cost-adjusted TPS, not raw speed.

Sources

Live API benchmarking (March 2026)
Provider documentation and performance claims
Third-party inference speed benchmarks
vLLM performance reports
Groq official specifications
Industry LLM benchmark aggregation

Contents