LLM API Rate Limits Compared: All Providers

Rate Limit Overview
Provider Comparisons
Free vs Paid Tiers
Request per Minute Limits
Token per Minute Limits
Handling Rate Limits
FAQ
Related Resources
Sources

Rate Limit Overview

Rate limits cap the API calls. Two types:

Requests Per Minute (RPM): Max calls per 60 seconds. Matters for high-frequency, low-token work (classification, embedding).

Tokens Per Minute (TPM): Max token throughput. Matters for high-volume work (batch, parallelization).

Providers enforce both. Hit one, and requests are throttled.

Provider Comparisons

OpenAI GPT-4 Turbo

Free Trial:

RPM: 3 requests/minute
TPM: 40,000 tokens/minute
Monthly limit: $5 credit expires after 3 months

Paid (Pay-as-you-go, Tier 1–4):

RPM: 3,500 requests/minute
TPM: 2,000,000 tokens/minute
Soft limits increase automatically with account spend history

Tier 5 (highest spend tier):

RPM: 10,000+ requests/minute
TPM: 30,000,000 tokens/minute
Requires $1,000+ cumulative spend

These limits accommodate continuous inference at scale. Sufficient for all but the largest production deployments.

Exceeding limits triggers 429 HTTP status (rate limited). Retry-after header indicates wait duration before retry. OpenAI recommends exponential backoff.

Anthropic Claude 3

Free Trial:

RPM: 5 requests/minute
TPM: 40,000 tokens/minute
No expiration but account reviewed monthly

Paid (Pay-as-you-go):

RPM: 1,000 requests/minute
TPM: 400,000 tokens/minute

Build tier (higher spend):

RPM: 4,000 requests/minute
TPM: 400,000 tokens/minute

Enterprise:

Custom limits based on commitment
Typical: 50,000+ RPM, 4,000,000+ TPM

Anthropic's limits reflect focus on long-context applications. Limits increase by tier as cumulative spend grows.

Cohere Command R+

Free Trial:

RPM: 50 requests/minute
TPM: 100,000 tokens/minute
$500 free credit

Paid:

RPM: 1,000 requests/minute
TPM: 500,000 tokens/minute

Organization-level limits (all users):

RPM: 10,000 requests/minute
TPM: 5,000,000 tokens/minute

Cohere's generous free tier RPM suits testing and prototyping. TPM limits higher than OpenAI at equivalent price points.

Groq API

Groq emphasizes throughput over request count limits:

Free Tier:

RPM: Unlimited
TPM: 14,400 tokens/minute
No authentication required

Paid:

RPM: Unlimited
TPM: 144,000 tokens/minute

Groq removes RPM limits entirely. Business model relies on low per-token cost, not request throttling. Exceptional for high-frequency classification tasks.

Together AI

Designed for distributed inference:

Free Trial:

RPM: 50 requests/minute
TPM: 100,000 tokens/minute

Standard:

RPM: 200 requests/minute
TPM: 500,000 tokens/minute

Large:

RPM: 1,000+ requests/minute
TPM: 2,000,000+ tokens/minute

Together's RPM limits reflect focus on batch processing rather than real-time inference.

Free vs Paid Tiers

Rate Limit Strategy Matrix

Provider	Free RPM	Free TPM	Paid RPM	Paid TPM	Max Tier TPM
OpenAI	3	40K	3,500	2M	30M (Tier 5)
Anthropic	5	40K	1,000	400K	4M+ (Enterprise)
Cohere	50	100K	1,000	500K	5M (org-level)
Groq	Unlimited	14.4K	Unlimited	144K	Custom
Together	50	100K	200	500K	2M+ (Large)

OpenAI and Anthropic maintain strict free tier limits. Cohere and Together more permissive for testing. Groq trades off TPM for unlimited RPM.

Free tier upgrades typically automatic once trial expires and payment added. No approval required.

Request per Minute Limits

RPM Impact on Real-Time Applications

OpenAI's 3 RPM free tier handles single user concurrent requests. Three simultaneous API calls would hit limit immediately.

Paid tiers (3,500+ RPM) enable:

100+ concurrent users with single request each
Real-time chatbot serving small user base
Batch processing 58 requests/second

RPM becomes bottleneck for high-frequency classification workloads. Example: processing 100K documents through moderation API requires 1,667 minutes at Cohere's paid 1,000 RPM limit, or 100 minutes with Together's 1,000 RPM tier.

Token per Minute Limits

TPM Impact on Throughput

TPM limits often hit before RPM in token-heavy workloads.

Example calculations (assuming 2,000 input tokens + 200 output tokens per request):

OpenAI Free (40K TPM):

Requests accommodated per minute: 40,000 / 2,200 = 18 requests/minute
At 3 RPM limit, TPM actually provides spare capacity
Bottleneck: RPM (3) not TPM (40K)

Anthropic Paid (400K TPM):

Requests accommodated per minute: 400,000 / 2,200 = 181 requests/minute
At 1,000 RPM limit (standard paid tier), TPM provides spare capacity
Bottleneck: RPM (1K) not TPM (400K)

Groq Free (14.4K TPM):

Requests accommodated per minute: 14,400 / 2,200 = 6.5 requests/minute
Groq's unlimited RPM irrelevant; TPM is actual constraint
TPM bottleneck prevents leveraging unlimited RPM

Handling Rate Limits

Exponential Backoff Strategy

Standard approach for handling 429 responses:

import anthropic
import time

client = anthropic.Anthropic(api_key="YOUR_KEY")

def get_message_with_backoff(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-opus-4-6-20260901",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) + (random.random() * 0.5)
            print(f"Rate limited. Waiting {wait_time:.1f}s")
            time.sleep(wait_time)

Standard backoff: 2^attempt seconds plus random jitter. Prevents thundering herd during provider recovery.

Request Queuing

For batch workloads, queue requests and process at rate limit:

from queue import Queue
import threading
import time

request_queue = Queue()
results = []
min_interval = 60 / 1000  # 1000 RPM = ~0.06s per request

def worker():
    last_request_time = 0
    while True:
        request = request_queue.get()
        if request is None:
            break

        elapsed = time.time() - last_request_time
        if elapsed < min_interval:
            time.sleep(min_interval - elapsed)

        result = api_call(request)
        results.append(result)
        last_request_time = time.time()
        request_queue.task_done()

for doc in documents:
    request_queue.put(doc)

threads = [threading.Thread(target=worker) for _ in range(4)]
for t in threads:
    t.start()

Four parallel workers can process 4,000 RPM with throttling. Scales linearly with worker count.

Batch API Usage

Most providers offer batch APIs with higher effective throughput:

OpenAI Batch API: 50% cost reduction, 24-hour processing window Anthropic Batch API: 50% cost reduction, 24-hour processing window Cohere Batch API: 20% cost reduction, 4-hour processing window

Batch APIs remove per-request overhead, allowing higher throughput and lower cost simultaneously.

FAQ

Q: Which provider has highest rate limits? OpenAI Tier 5 offers the highest TPM (30M TPM), though it requires significant spend history. Anthropic's standard paid tier provides 400K TPM, scaling to 4M+ at enterprise level. Groq (unlimited RPM, 144K TPM paid) is best for high-frequency low-token requests.

Q: Do rate limits reset hourly or daily? All providers use sliding window minutes. Limits reset continuously. A 1,000 RPM limit means 1,000 requests allowed in any 60-second window.

Q: What happens when I exceed rate limits? API returns 429 HTTP status (Too Many Requests). Response includes retry-after header indicating wait duration. Requests made during cooldown fail immediately with same 429 error.

Q: Can I request higher rate limits? Yes. All providers support limit increases for committed spend or on-demand requests. OpenAI increases automatically as account spend increases. Anthropic and Cohere allow explicit limit requests via API dashboard.

Q: Should I use batch APIs for production inference? No. Batch APIs require 4-24 hour latency. Suitable for non-time-sensitive workloads (daily digests, weekly reports, offline analytics). Real-time inference requires standard APIs despite higher cost.

Q: How do I calculate TPM requirements? Estimate maximum tokens per request (input + output) and maximum requests per minute. Multiply together: (avg_input_tokens + avg_output_tokens) * max_requests_per_minute = required TPM.

Sources

OpenAI API Rate Limits Documentation
Anthropic API Documentation
Cohere API Reference
Groq API Documentation
Together AI API Guidelines
Industry LLM API Comparison Report (March 2026)

Contents

Rate Limit Overview

Provider Comparisons

Free vs Paid Tiers

Request per Minute Limits

Token per Minute Limits

Handling Rate Limits

FAQ

Related Resources

Sources