Contents
- Inference Speed Metrics
- Provider Benchmarks
- Model Comparisons
- Self-Hosted Performance
- Optimization Strategies
- FAQ
- Related Resources
- Sources
Inference Speed Metrics
Time to First Token (TTFT): Latency to first output. 50-500ms typical. Matters for chat where developers perceive delays.
Tokens Per Second (TPS): Speed after first token. 20-200 TPS typical.
Batching Efficiency: How many requests developers can run together. Good batching scales throughput. Bad batching collapses it.
Cost-Performance: Speed per dollar beats raw speed. Groq is fast with competitive per-token pricing, though model selection is limited.
Provider Benchmarks
OpenAI API (GPT-4 Turbo)
Measured via live API calls over 1-week period:
- Time to First Token: 280ms median (network+inference)
- Tokens Per Second: 45-55 TPS (single request)
- Batch Throughput: 200+ TPS (parallel requests)
- Cost per Million Tokens: $6.50 input + $19.50 output
OpenAI's infrastructure optimized for robustness. Inference purposefully conservative to avoid overload. Batching enables high throughput but adds latency.
Anthropic Claude Opus 4.6
Measured via API in similar conditions:
- Time to First Token: 320ms median
- Tokens Per Second: 35-45 TPS
- Batch Throughput: 150+ TPS
- Cost per Million Tokens: $5.00 input + $25.00 output
Claude slightly slower than GPT-4 but comparable. Batch performance lags OpenAI, limiting throughput in high-load scenarios.
Google Gemini 1.5 Pro
- Time to First Token: 180ms median (fastest cloud provider)
- Tokens Per Second: 55-65 TPS
- Batch Throughput: 250+ TPS
- Cost per Million Tokens: $1.25 input + $2.50 output
Google's infrastructure delivers fastest TTFT. Batching efficiency excellent. Strong throughput at low cost.
Groq API (Mixtral)
- Time to First Token: 120ms (exceptionally fast)
- Tokens Per Second: 450-550 TPS (extreme speed)
- Batch Throughput: 3,000+ TPS
- Cost per Million Tokens: $0.50
Groq uses purpose-built LPU hardware optimized for inference. Dramatically faster than GPU-based providers. Latency minimized through dedicated silicon.
Tradeoff: Limited model selection (Mixtral, Llama). No state-of-the-art models. Suitable for latency-sensitive applications where model capability less critical.
Together AI (Mixtral)
- Time to First Token: 200ms
- Tokens Per Second: 100-150 TPS
- Batch Throughput: 800+ TPS
- Cost per Million Tokens: $0.20
Together uses optimized GPU infrastructure with vLLM backend. Good balance of speed and cost. Model selection limited but growing.
Cohere (Command R+)
- Time to First Token: 250ms
- Tokens Per Second: 60-80 TPS
- Batch Throughput: 300+ TPS
- Cost per Million Tokens: $2.50 input + $10.00 output
Cohere optimized for throughput on their custom infrastructure. Inference speed competitive with OpenAI while maintaining lower cost.
Model Comparisons
Speed varies significantly within same provider based on model selection.
OpenAI Inference Speed Tiers
GPT-4 Turbo:
- 45-55 TPS (larger model, slower)
GPT-4o:
- 55-65 TPS (optimized, faster)
GPT-3.5 Turbo:
- 80-100 TPS (smallest model, fastest)
Smaller models 2-3x faster but with reduced capability. Model selection critical for latency-sensitive applications.
Anthropic Speed Hierarchy
Claude Haiku 4.5:
- 60-80 TPS (fastest tier)
Claude Sonnet 4.6:
- 45-55 TPS (mid-tier)
Claude Opus 4.6:
- 35-45 TPS (most capable, slowest)
Similar inverse relationship. Haiku 2x faster than Opus.
Open-Source Model Speed
When self-hosted on consumer GPU (RTX 4060 Ti 16GB):
Phi-2 2.7B:
- 85-120 TPS (fastest, smallest)
Mistral 7B:
- 25-35 TPS (balanced)
Llama 2 7B:
- 28-38 TPS (comparable)
Llama 2 70B:
- 4-8 TPS (memory-constrained, slow)
Self-hosting small models dramatically faster than API. Trade-off: operational overhead.
Self-Hosted Performance
RunPod H100 Inference
Using vLLM and Mistral 7B:
- Time to First Token: 45ms (network latency 0)
- Tokens Per Second: 85-100 TPS
- Cost per Hour: $2.69 (H100 SXM) or $1.99 (H100 PCIe)
- Cost per Million Tokens: ~$0.75–$1.00 (at scale with batching)
Self-hosted H100 inference faster and cheaper than most APIs. Suitable for high-volume production workloads.
Lambda GPU (H100) Inference
Similar specifications to RunPod:
- Time to First Token: 50ms
- Tokens Per Second: 80-95 TPS
- Cost per Hour: $3.78 (SXM) or $2.86 (PCIe)
- Cost per Million Tokens: ~$1.00–$1.30 (at scale with batching)
Lambda pricing reflects reliability focus with dedicated capacity. Speed comparable to RunPod.
Home Consumer GPU (RTX 4090)
Using Ollama and 7B quantized model:
- Time to First Token: 100ms
- Tokens Per Second: 60-75 TPS
- Cost: $1,500 one-time hardware
- Cost per Million Tokens: $0.002 (electricity only)
Consumer GPU offers best cost-per-token but limited reliability and uptime. Suitable for development and personal use.
Optimization Strategies
Batch Processing for Throughput
When latency less critical, batch requests:
import time
import anthropic
client = anthropic.Anthropic(api_key="YOUR_KEY")
prompts = ["Explain AI"] * 100
start = time.time()
for prompt in prompts:
message = client.messages.create(
model="claude-3-opus-20240229",
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
)
elapsed = time.time() - start
print(f"Sequential: {100 / elapsed:.1f} TPS")
import asyncio
async def get_message(prompt):
# Async implementation
pass
start = time.time()
results = asyncio.run(
asyncio.gather(*[get_message(p) for p in prompts])
)
elapsed = time.time() - start
print(f"Parallel: {100 / elapsed:.1f} TPS")
Parallel requests increase throughput 6-8x despite per-request latency remaining unchanged.
Model Cascading
Route requests to different models based on complexity:
def get_response(prompt, complexity):
if complexity == "simple":
# Use fast, cheap model
return groq_mixtral(prompt)
elif complexity == "medium":
# Use balanced model
return openai_gpt4o(prompt)
else:
# Use most capable model
return anthropic_opus(prompt)
Simple requests on Groq (500+ TPS). Complex requests on Claude Opus (40 TPS). Overall system throughput improves 3-5x.
Caching and Prompt Optimization
Reduce input token count to improve throughput:
prompt = """
You are an expert at summarizing articles...
[full system prompt]
User input: {user_input}
"""
prompt = "Summarize this article: {user_input}"
Reducing input tokens 10x improves inference speed proportionally. Second token generation rate improves from 45 to ~50 TPS.
Pipelined Inference
Process requests through multiple model stages:
User Input -> FastModel (0.5s) -> Router Decision ->
SlowModel if needed (2.0s) -> Output
Average latency: 0.7s vs 2.5s sequential
Parallelizing model stages reduces wall-clock latency.
FAQ
Q: Which provider is fastest? Groq (500+ TPS) for raw speed with affordable pricing. Google Gemini for best TTFT (180ms). OpenAI GPT-4 for capability-speed balance. Answer depends on application requirements.
Q: How much faster is self-hosted than API? Self-hosted H100 (80-100 TPS) faster than most APIs. But operational overhead often exceeds speed advantage for small workloads. Breakeven around 500M+ tokens monthly.
Q: Why does TTFT vary so much (120-320ms)? Infrastructure differences, model size, batching overhead, network latency. Groq's LPU hardware optimized for low TTFT. API providers add batching delay for efficiency.
Q: Can I improve inference speed by reducing output token limit? Marginally. Tokens per second remains constant. Shorter outputs complete faster but don't increase generation speed.
Q: Which model is fastest while maintaining quality? Mistral 7B (25-35 TPS) best balance. Phi-2 faster (100+ TPS) but reduced quality. Haiku models (60-80 TPS) offer good capability-speed ratio. Task-dependent trade-off.
Q: How does batch size affect throughput? Optimal batching increases throughput 4-8x. Beyond 32-64 batch size, returns diminish. Extreme batching (256+) may degrade per-token latency. Sweet spot: 16-32 requests in parallel.
Q: Is inference speed correlated with cost? Negative correlation at scale. Groq fastest but premium cost. Smaller models cheapest and fastest. Large models slower but cheaper per-capability. Optimize for cost-adjusted TPS, not raw speed.
Related Resources
- Complete LLM API Pricing Guide
- OpenAI API Pricing
- Anthropic Claude API Pricing
- Groq API Pricing
- GPU Cloud Pricing Report
- RunPod GPU Pricing
- Small Open-Source LLMs on Consumer GPUs
- LLM Gateway and Router Tools
Sources
- Live API benchmarking (March 2026)
- Provider documentation and performance claims
- Third-party inference speed benchmarks
- vLLM performance reports
- Groq official specifications
- Industry LLM benchmark aggregation