Contents
- What Is Time-to-First-Token?
- Provider TTFT Benchmarks
- Throughput: Tokens Per Second
- Latency by Model Tier
- Optimizing Applications for Latency
- FAQ
- Related Resources
- Sources
What Is Time-to-First-Token?
Time-to-first-token (TTFT) is the duration between sending an API request and receiving the first token of the response. It is the primary latency metric for interactive applications because it determines how long the user waits before seeing any output.
Two distinct metrics define API speed:
- TTFT (Time-to-First-Token): Latency before the response begins. Determined by model loading, prompt processing, and network round-trip.
- TPS (Tokens Per Second): The generation speed after the first token. Determines how quickly the full response completes.
For chatbots and copilots, TTFT dominates perceived speed. For batch processing, TPS matters more.
Provider TTFT Benchmarks
Measured TTFT values for a 500-token input prompt generating up to 200 output tokens, observed from US East Coast, March 2026:
| Provider | Model | TTFT (p50) | TTFT (p95) | Notes |
|---|---|---|---|---|
| OpenAI | GPT-4o | 250ms | 700ms | Consistently fast |
| OpenAI | GPT-5 | 300ms | 800ms | Slight overhead vs 4o |
| OpenAI | GPT-4o Mini | 150ms | 400ms | Fastest OpenAI option |
| Anthropic | Claude Sonnet 4.6 | 200ms | 500ms | Most consistent |
| Anthropic | Claude Haiku 4.5 | 150ms | 350ms | Fastest Anthropic option |
| Anthropic | Claude Opus 4.6 | 400ms | 1,000ms | Larger model, slower start |
| Gemini 2.5 Pro | 300ms | 800ms | Good for long context | |
| Gemini 2.5 Flash | 100ms | 300ms | Fastest mainstream model | |
| DeepSeek | DeepSeek V3 | 400ms | 1,200ms | Variable due to region |
| DeepSeek | DeepSeek R1 | 600ms | 2,000ms | Reasoning overhead |
| Mistral | Mistral Large | 200ms | 500ms | Fast European option |
| Groq | Llama 3 70B | 50ms | 150ms | Hardware-accelerated |
Key finding: Groq's LPU-based inference delivers the lowest absolute TTFT but is constrained by TPM limits. For frontier model quality with low latency, Claude Haiku and GPT-4o Mini are the strongest options.
Throughput: Tokens Per Second
After the first token arrives, generation speed determines total response time. Higher TPS means faster completion for long responses.
| Provider | Model | TPS (typical) | Effect on 500-token response |
|---|---|---|---|
| Groq | Llama 3 70B | 250–400 tok/s | ~1.5 seconds |
| Anthropic | Claude Haiku 4.5 | 80–120 tok/s | ~5 seconds |
| OpenAI | GPT-4o Mini | 80–100 tok/s | ~6 seconds |
| Anthropic | Claude Sonnet 4.6 | 60–80 tok/s | ~7 seconds |
| OpenAI | GPT-4o | 50–70 tok/s | ~8 seconds |
| Gemini 2.5 Flash | 100–150 tok/s | ~4 seconds | |
| Anthropic | Claude Opus 4.6 | 30–50 tok/s | ~12 seconds |
| DeepSeek | DeepSeek V3 | 30–60 tok/s | ~10 seconds |
For streaming UIs, high TPS reduces total wall-clock time. For non-streaming (batch) applications, only total response time matters, not TTFT.
Latency by Model Tier
Latency correlates with model size and architecture. Larger models process more parameters per token, increasing both TTFT and per-token generation time.
Ultra-fast tier (TTFT < 200ms):
- Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o Mini
- Best for: real-time chat, autocomplete, classification
Standard tier (TTFT 200–500ms):
- Claude Sonnet 4.6, GPT-4o, Mistral Large
- Best for: general assistants, customer support
Premium/reasoning tier (TTFT 400ms–2s):
- Claude Opus 4.6, GPT-5, DeepSeek R1
- Best for: complex analysis, code generation, reasoning tasks
For applications where TTFT matters, avoid routing requests to reasoning models (o3, R1) for simple tasks. Reserve them for workloads that justify the latency trade-off.
Optimizing Applications for Latency
Reducing perceived latency improves user experience without changing backend performance:
- Display partial results: Show initial text as it arrives, not after full completion
- Implement streaming: Stream tokens to client as generated, not batch responses
- User feedback: Display loading indicators immediately to signal active processing
- Progressive enhancement: Show cached or default results while fetching optimal results
These UI patterns transform users' perception of API latency from critical failure to acceptable performance.
FAQ
Q: Which LLM API responds fastest? Claude Sonnet 4.6 typically responds fastest (200-500ms). OpenAI GPT-5 averages 300-800ms. DeepSeek R1 ranges from 400-1200ms depending on reasoning requirements.
Q: How much does network latency matter? Network latency adds 50-300ms depending on geography. For East Coast US users, network adds ~75ms. Optimizing network often proves easier than improving API latency.
Q: Should applications cache LLM responses? Yes. Caching eliminates API latency for repeated queries. Even 30% cache hit rates significantly improve perceived responsiveness and reduce costs.
Q: Can streaming improve latency perception? Yes. Streaming tokens as they generate creates immediate visual feedback. Users perceive 500ms first token plus streaming as faster than 2-3 second batch responses.
Q: Which provider has most consistent latency? Anthropic shows least latency variance. OpenAI shows moderate variance. DeepSeek shows highest variance due to reasoning overhead.
Related Resources
- OpenAI API Pricing
- Anthropic API Pricing
- DeepSeek API Pricing
- LLM API Pricing Comparison
- Best LLM API for Production
Sources
- OpenAI: API performance metrics and documentation (as of March 2026)
- Anthropic: Claude API latency characteristics and case studies
- DeepSeek: Model performance benchmarks and latency profiles
- Industry measurements of API latency across providers
- Network latency databases (IP2Latency, WarpSpeed)
- Streaming API implementation guides