LLM API Latency Comparison: Time-to-First-Token Analysis

Deploybase · June 26, 2025 · LLM Pricing

Contents


What Is Time-to-First-Token?

Time-to-first-token (TTFT) is the duration between sending an API request and receiving the first token of the response. It is the primary latency metric for interactive applications because it determines how long the user waits before seeing any output.

Two distinct metrics define API speed:

  • TTFT (Time-to-First-Token): Latency before the response begins. Determined by model loading, prompt processing, and network round-trip.
  • TPS (Tokens Per Second): The generation speed after the first token. Determines how quickly the full response completes.

For chatbots and copilots, TTFT dominates perceived speed. For batch processing, TPS matters more.

Provider TTFT Benchmarks

Measured TTFT values for a 500-token input prompt generating up to 200 output tokens, observed from US East Coast, March 2026:

ProviderModelTTFT (p50)TTFT (p95)Notes
OpenAIGPT-4o250ms700msConsistently fast
OpenAIGPT-5300ms800msSlight overhead vs 4o
OpenAIGPT-4o Mini150ms400msFastest OpenAI option
AnthropicClaude Sonnet 4.6200ms500msMost consistent
AnthropicClaude Haiku 4.5150ms350msFastest Anthropic option
AnthropicClaude Opus 4.6400ms1,000msLarger model, slower start
GoogleGemini 2.5 Pro300ms800msGood for long context
GoogleGemini 2.5 Flash100ms300msFastest mainstream model
DeepSeekDeepSeek V3400ms1,200msVariable due to region
DeepSeekDeepSeek R1600ms2,000msReasoning overhead
MistralMistral Large200ms500msFast European option
GroqLlama 3 70B50ms150msHardware-accelerated

Key finding: Groq's LPU-based inference delivers the lowest absolute TTFT but is constrained by TPM limits. For frontier model quality with low latency, Claude Haiku and GPT-4o Mini are the strongest options.

Throughput: Tokens Per Second

After the first token arrives, generation speed determines total response time. Higher TPS means faster completion for long responses.

ProviderModelTPS (typical)Effect on 500-token response
GroqLlama 3 70B250–400 tok/s~1.5 seconds
AnthropicClaude Haiku 4.580–120 tok/s~5 seconds
OpenAIGPT-4o Mini80–100 tok/s~6 seconds
AnthropicClaude Sonnet 4.660–80 tok/s~7 seconds
OpenAIGPT-4o50–70 tok/s~8 seconds
GoogleGemini 2.5 Flash100–150 tok/s~4 seconds
AnthropicClaude Opus 4.630–50 tok/s~12 seconds
DeepSeekDeepSeek V330–60 tok/s~10 seconds

For streaming UIs, high TPS reduces total wall-clock time. For non-streaming (batch) applications, only total response time matters, not TTFT.

Latency by Model Tier

Latency correlates with model size and architecture. Larger models process more parameters per token, increasing both TTFT and per-token generation time.

Ultra-fast tier (TTFT < 200ms):

  • Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o Mini
  • Best for: real-time chat, autocomplete, classification

Standard tier (TTFT 200–500ms):

  • Claude Sonnet 4.6, GPT-4o, Mistral Large
  • Best for: general assistants, customer support

Premium/reasoning tier (TTFT 400ms–2s):

  • Claude Opus 4.6, GPT-5, DeepSeek R1
  • Best for: complex analysis, code generation, reasoning tasks

For applications where TTFT matters, avoid routing requests to reasoning models (o3, R1) for simple tasks. Reserve them for workloads that justify the latency trade-off.

Optimizing Applications for Latency

Reducing perceived latency improves user experience without changing backend performance:

  1. Display partial results: Show initial text as it arrives, not after full completion
  2. Implement streaming: Stream tokens to client as generated, not batch responses
  3. User feedback: Display loading indicators immediately to signal active processing
  4. Progressive enhancement: Show cached or default results while fetching optimal results

These UI patterns transform users' perception of API latency from critical failure to acceptable performance.

FAQ

Q: Which LLM API responds fastest? Claude Sonnet 4.6 typically responds fastest (200-500ms). OpenAI GPT-5 averages 300-800ms. DeepSeek R1 ranges from 400-1200ms depending on reasoning requirements.

Q: How much does network latency matter? Network latency adds 50-300ms depending on geography. For East Coast US users, network adds ~75ms. Optimizing network often proves easier than improving API latency.

Q: Should applications cache LLM responses? Yes. Caching eliminates API latency for repeated queries. Even 30% cache hit rates significantly improve perceived responsiveness and reduce costs.

Q: Can streaming improve latency perception? Yes. Streaming tokens as they generate creates immediate visual feedback. Users perceive 500ms first token plus streaming as faster than 2-3 second batch responses.

Q: Which provider has most consistent latency? Anthropic shows least latency variance. OpenAI shows moderate variance. DeepSeek shows highest variance due to reasoning overhead.

Sources

  • OpenAI: API performance metrics and documentation (as of March 2026)
  • Anthropic: Claude API latency characteristics and case studies
  • DeepSeek: Model performance benchmarks and latency profiles
  • Industry measurements of API latency across providers
  • Network latency databases (IP2Latency, WarpSpeed)
  • Streaming API implementation guides