LLM API Rate Limits Compared: All Providers

Deploybase · July 8, 2025 · LLM Pricing

Contents

Rate Limit Overview

Rate limits cap the API calls. Two types:

Requests Per Minute (RPM): Max calls per 60 seconds. Matters for high-frequency, low-token work (classification, embedding).

Tokens Per Minute (TPM): Max token throughput. Matters for high-volume work (batch, parallelization).

Providers enforce both. Hit one, developers're throttled.

Provider Comparisons

OpenAI GPT-4 Turbo

Free Trial:

  • RPM: 3 requests/minute
  • TPM: 40,000 tokens/minute
  • Monthly limit: $5 credit expires after 3 months

Paid (Pay-as-you-go, Tier 1–4):

  • RPM: 3,500 requests/minute
  • TPM: 2,000,000 tokens/minute
  • Soft limits increase automatically with account spend history

Tier 5 (highest spend tier):

  • RPM: 10,000+ requests/minute
  • TPM: 30,000,000 tokens/minute
  • Requires $1,000+ cumulative spend

These limits accommodate continuous inference at scale. Sufficient for all but the largest production deployments.

Exceeding limits triggers 429 HTTP status (rate limited). Retry-after header indicates wait duration before retry. OpenAI recommends exponential backoff.

Anthropic Claude 3

Free Trial:

  • RPM: 5 requests/minute
  • TPM: 40,000 tokens/minute
  • No expiration but account reviewed monthly

Paid (Pay-as-you-go):

  • RPM: 1,000 requests/minute
  • TPM: 400,000 tokens/minute

Build tier (higher spend):

  • RPM: 4,000 requests/minute
  • TPM: 400,000 tokens/minute

Enterprise:

  • Custom limits based on commitment
  • Typical: 50,000+ RPM, 4,000,000+ TPM

Anthropic's limits reflect focus on long-context applications. Limits increase by tier as cumulative spend grows.

Cohere Command R+

Free Trial:

  • RPM: 50 requests/minute
  • TPM: 100,000 tokens/minute
  • $500 free credit

Paid:

  • RPM: 1,000 requests/minute
  • TPM: 500,000 tokens/minute

Organization-level limits (all users):

  • RPM: 10,000 requests/minute
  • TPM: 5,000,000 tokens/minute

Cohere's generous free tier RPM suits testing and prototyping. TPM limits higher than OpenAI at equivalent price points.

Groq API

Groq emphasizes throughput over request count limits:

Free Tier:

  • RPM: Unlimited
  • TPM: 14,400 tokens/minute
  • No authentication required

Paid:

  • RPM: Unlimited
  • TPM: 144,000 tokens/minute

Groq removes RPM limits entirely. Business model relies on low per-token cost, not request throttling. Exceptional for high-frequency classification tasks.

Together AI

Designed for distributed inference:

Free Trial:

  • RPM: 50 requests/minute
  • TPM: 100,000 tokens/minute

Standard:

  • RPM: 200 requests/minute
  • TPM: 500,000 tokens/minute

Large:

  • RPM: 1,000+ requests/minute
  • TPM: 2,000,000+ tokens/minute

Together's RPM limits reflect focus on batch processing rather than real-time inference.

Free vs Paid Tiers

Rate Limit Strategy Matrix

ProviderFree RPMFree TPMPaid RPMPaid TPMMax Tier TPM
OpenAI340K3,5002M30M (Tier 5)
Anthropic540K1,000400K4M+ (Enterprise)
Cohere50100K1,000500K5M (org-level)
GroqUnlimited14.4KUnlimited144KCustom
Together50100K200500K2M+ (Large)

OpenAI and Anthropic maintain strict free tier limits. Cohere and Together more permissive for testing. Groq trades off TPM for unlimited RPM.

Free tier upgrades typically automatic once trial expires and payment added. No approval required.

Request per Minute Limits

RPM Impact on Real-Time Applications

OpenAI's 3 RPM free tier handles single user concurrent requests. Three simultaneous API calls would hit limit immediately.

Paid tiers (3,500+ RPM) enable:

  • 100+ concurrent users with single request each
  • Real-time chatbot serving small user base
  • Batch processing 58 requests/second

RPM becomes bottleneck for high-frequency classification workloads. Example: processing 100K documents through moderation API requires 1,667 minutes at Cohere's paid 1,000 RPM limit, or 100 minutes with Together's 1,000 RPM tier.

Token per Minute Limits

TPM Impact on Throughput

TPM limits often hit before RPM in token-heavy workloads.

Example calculations (assuming 2,000 input tokens + 200 output tokens per request):

OpenAI Free (40K TPM):

  • Requests accommodated per minute: 40,000 / 2,200 = 18 requests/minute
  • At 3 RPM limit, TPM actually provides spare capacity
  • Bottleneck: RPM (3) not TPM (40K)

Anthropic Paid (400K TPM):

  • Requests accommodated per minute: 400,000 / 2,200 = 181 requests/minute
  • At 1,000 RPM limit (standard paid tier), TPM provides spare capacity
  • Bottleneck: RPM (1K) not TPM (400K)

Groq Free (14.4K TPM):

  • Requests accommodated per minute: 14,400 / 2,200 = 6.5 requests/minute
  • Groq's unlimited RPM irrelevant; TPM is actual constraint
  • TPM bottleneck prevents leveraging unlimited RPM

Handling Rate Limits

Exponential Backoff Strategy

Standard approach for handling 429 responses:

import anthropic
import time

client = anthropic.Anthropic(api_key="YOUR_KEY")

def get_message_with_backoff(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-opus-4-6-20260901",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + (random.random() * 0.5)
            print(f"Rate limited. Waiting {wait_time:.1f}s")
            time.sleep(wait_time)

Standard backoff: 2^attempt seconds plus random jitter. Prevents thundering herd during provider recovery.

Request Queuing

For batch workloads, queue requests and process at rate limit:

from queue import Queue
import threading
import time

request_queue = Queue()
results = []
min_interval = 60 / 1000  # 1000 RPM = ~0.06s per request

def worker():
    last_request_time = 0
    while True:
        request = request_queue.get()
        if request is None:
            break

        elapsed = time.time() - last_request_time
        if elapsed < min_interval:
            time.sleep(min_interval - elapsed)

        result = api_call(request)
        results.append(result)
        last_request_time = time.time()
        request_queue.task_done()

for doc in documents:
    request_queue.put(doc)

threads = [threading.Thread(target=worker) for _ in range(4)]
for t in threads:
    t.start()

Four parallel workers can process 4,000 RPM with throttling. Scales linearly with worker count.

Batch API Usage

Most providers offer batch APIs with higher effective throughput:

OpenAI Batch API: 50% cost reduction, 24-hour processing window Anthropic Batch API: 50% cost reduction, 24-hour processing window Cohere Batch API: 20% cost reduction, 4-hour processing window

Batch APIs remove per-request overhead, allowing higher throughput and lower cost simultaneously.

FAQ

Q: Which provider has highest rate limits? OpenAI Tier 5 offers the highest TPM (30M TPM), though it requires significant spend history. Anthropic's standard paid tier provides 400K TPM, scaling to 4M+ at enterprise level. Groq (unlimited RPM, 144K TPM paid) is best for high-frequency low-token requests.

Q: Do rate limits reset hourly or daily? All providers use sliding window minutes. Limits reset continuously. A 1,000 RPM limit means 1,000 requests allowed in any 60-second window.

Q: What happens when I exceed rate limits? API returns 429 HTTP status (Too Many Requests). Response includes retry-after header indicating wait duration. Requests made during cooldown fail immediately with same 429 error.

Q: Can I request higher rate limits? Yes. All providers support limit increases for committed spend or on-demand requests. OpenAI increases automatically as account spend increases. Anthropic and Cohere allow explicit limit requests via API dashboard.

Q: Should I use batch APIs for production inference? No. Batch APIs require 4-24 hour latency. Suitable for non-time-sensitive workloads (daily digests, weekly reports, offline analytics). Real-time inference requires standard APIs despite higher cost.

Q: How do I calculate TPM requirements? Estimate maximum tokens per request (input + output) and maximum requests per minute. Multiply together: (avg_input_tokens + avg_output_tokens) * max_requests_per_minute = required TPM.

Sources

  • OpenAI API Rate Limits Documentation
  • Anthropic API Documentation
  • Cohere API Reference
  • Groq API Documentation
  • Together AI API Guidelines
  • Industry LLM API Comparison Report (March 2026)