LLM API Latency Comparison: Time-to-First-Token Analysis

What Is Time-to-First-Token?
Provider TTFT Benchmarks
Throughput: Tokens Per Second
Latency by Model Tier
Optimizing Applications for Latency
FAQ
Related Resources
Sources

What Is Time-to-First-Token?

Time-to-first-token (TTFT) is the duration between sending an API request and receiving the first token of the response. It is the primary latency metric for interactive applications because it determines how long the user waits before seeing any output.

Two distinct metrics define API speed:

TTFT (Time-to-First-Token): Latency before the response begins. Determined by model loading, prompt processing, and network round-trip.
TPS (Tokens Per Second): The generation speed after the first token. Determines how quickly the full response completes.

For chatbots and copilots, TTFT dominates perceived speed. For batch processing, TPS matters more.

Provider TTFT Benchmarks

Measured TTFT values for a 500-token input prompt generating up to 200 output tokens, observed from US East Coast, March 2026:

Provider	Model	TTFT (p50)	TTFT (p95)	Notes
OpenAI	GPT-4o	250ms	700ms	Consistently fast
OpenAI	GPT-5	300ms	800ms	Slight overhead vs 4o
OpenAI	GPT-4o Mini	150ms	400ms	Fastest OpenAI option
Anthropic	Claude Sonnet 4.6	200ms	500ms	Most consistent
Anthropic	Claude Haiku 4.5	150ms	350ms	Fastest Anthropic option
Anthropic	Claude Opus 4.6	400ms	1,000ms	Larger model, slower start
Google	Gemini 2.5 Pro	300ms	800ms	Good for long context
Google	Gemini 2.5 Flash	100ms	300ms	Fastest mainstream model
DeepSeek	DeepSeek V3	400ms	1,200ms	Variable due to region
DeepSeek	DeepSeek R1	600ms	2,000ms	Reasoning overhead
Mistral	Mistral Large	200ms	500ms	Fast European option
Groq	Llama 3 70B	50ms	150ms	Hardware-accelerated

Key finding: Groq's LPU-based inference delivers the lowest absolute TTFT but is constrained by TPM limits. For frontier model quality with low latency, Claude Haiku and GPT-4o Mini are the strongest options.

Throughput: Tokens Per Second

After the first token arrives, generation speed determines total response time. Higher TPS means faster completion for long responses.

Provider	Model	TPS (typical)	Effect on 500-token response
Groq	Llama 3 70B	250–400 tok/s	~1.5 seconds
Anthropic	Claude Haiku 4.5	80–120 tok/s	~5 seconds
OpenAI	GPT-4o Mini	80–100 tok/s	~6 seconds
Anthropic	Claude Sonnet 4.6	60–80 tok/s	~7 seconds
OpenAI	GPT-4o	50–70 tok/s	~8 seconds
Google	Gemini 2.5 Flash	100–150 tok/s	~4 seconds
Anthropic	Claude Opus 4.6	30–50 tok/s	~12 seconds
DeepSeek	DeepSeek V3	30–60 tok/s	~10 seconds

For streaming UIs, high TPS reduces total wall-clock time. For non-streaming (batch) applications, only total response time matters, not TTFT.

Latency by Model Tier

Latency correlates with model size and architecture. Larger models process more parameters per token, increasing both TTFT and per-token generation time.

Ultra-fast tier (TTFT < 200ms):

Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o Mini
Best for: real-time chat, autocomplete, classification

Standard tier (TTFT 200–500ms):

Claude Sonnet 4.6, GPT-4o, Mistral Large
Best for: general assistants, customer support

Premium/reasoning tier (TTFT 400ms–2s):

Claude Opus 4.6, GPT-5, DeepSeek R1
Best for: complex analysis, code generation, reasoning tasks

For applications where TTFT matters, avoid routing requests to reasoning models (o3, R1) for simple tasks. Reserve them for workloads that justify the latency trade-off.

Optimizing Applications for Latency

Reducing perceived latency improves user experience without changing backend performance:

Display partial results: Show initial text as it arrives, not after full completion
Implement streaming: Stream tokens to client as generated, not batch responses
User feedback: Display loading indicators immediately to signal active processing
Progressive enhancement: Show cached or default results while fetching optimal results

These UI patterns transform users' perception of API latency from critical failure to acceptable performance.

FAQ

Q: Which LLM API responds fastest? Claude Sonnet 4.6 typically responds fastest (200-500ms). OpenAI GPT-5 averages 300-800ms. DeepSeek R1 ranges from 400-1200ms depending on reasoning requirements.

Q: How much does network latency matter? Network latency adds 50-300ms depending on geography. For East Coast US users, network adds ~75ms. Optimizing network often proves easier than improving API latency.

Q: Should applications cache LLM responses? Yes. Caching eliminates API latency for repeated queries. Even 30% cache hit rates significantly improve perceived responsiveness and reduce costs.

Q: Can streaming improve latency perception? Yes. Streaming tokens as they generate creates immediate visual feedback. Users perceive 500ms first token plus streaming as faster than 2-3 second batch responses.

Q: Which provider has most consistent latency? Anthropic shows least latency variance. OpenAI shows moderate variance. DeepSeek shows highest variance due to reasoning overhead.

Sources

OpenAI: API performance metrics and documentation (as of March 2026)
Anthropic: Claude API latency characteristics and case studies
DeepSeek: Model performance benchmarks and latency profiles
Industry measurements of API latency across providers
Network latency databases (IP2Latency, WarpSpeed)
Streaming API implementation guides

Contents