Contents
- Rate Limit Overview
- Provider Comparisons
- Free vs Paid Tiers
- Request per Minute Limits
- Token per Minute Limits
- Handling Rate Limits
- FAQ
- Related Resources
- Sources
Rate Limit Overview
Rate limits cap the API calls. Two types:
Requests Per Minute (RPM): Max calls per 60 seconds. Matters for high-frequency, low-token work (classification, embedding).
Tokens Per Minute (TPM): Max token throughput. Matters for high-volume work (batch, parallelization).
Providers enforce both. Hit one, developers're throttled.
Provider Comparisons
OpenAI GPT-4 Turbo
Free Trial:
- RPM: 3 requests/minute
- TPM: 40,000 tokens/minute
- Monthly limit: $5 credit expires after 3 months
Paid (Pay-as-you-go, Tier 1–4):
- RPM: 3,500 requests/minute
- TPM: 2,000,000 tokens/minute
- Soft limits increase automatically with account spend history
Tier 5 (highest spend tier):
- RPM: 10,000+ requests/minute
- TPM: 30,000,000 tokens/minute
- Requires $1,000+ cumulative spend
These limits accommodate continuous inference at scale. Sufficient for all but the largest production deployments.
Exceeding limits triggers 429 HTTP status (rate limited). Retry-after header indicates wait duration before retry. OpenAI recommends exponential backoff.
Anthropic Claude 3
Free Trial:
- RPM: 5 requests/minute
- TPM: 40,000 tokens/minute
- No expiration but account reviewed monthly
Paid (Pay-as-you-go):
- RPM: 1,000 requests/minute
- TPM: 400,000 tokens/minute
Build tier (higher spend):
- RPM: 4,000 requests/minute
- TPM: 400,000 tokens/minute
Enterprise:
- Custom limits based on commitment
- Typical: 50,000+ RPM, 4,000,000+ TPM
Anthropic's limits reflect focus on long-context applications. Limits increase by tier as cumulative spend grows.
Cohere Command R+
Free Trial:
- RPM: 50 requests/minute
- TPM: 100,000 tokens/minute
- $500 free credit
Paid:
- RPM: 1,000 requests/minute
- TPM: 500,000 tokens/minute
Organization-level limits (all users):
- RPM: 10,000 requests/minute
- TPM: 5,000,000 tokens/minute
Cohere's generous free tier RPM suits testing and prototyping. TPM limits higher than OpenAI at equivalent price points.
Groq API
Groq emphasizes throughput over request count limits:
Free Tier:
- RPM: Unlimited
- TPM: 14,400 tokens/minute
- No authentication required
Paid:
- RPM: Unlimited
- TPM: 144,000 tokens/minute
Groq removes RPM limits entirely. Business model relies on low per-token cost, not request throttling. Exceptional for high-frequency classification tasks.
Together AI
Designed for distributed inference:
Free Trial:
- RPM: 50 requests/minute
- TPM: 100,000 tokens/minute
Standard:
- RPM: 200 requests/minute
- TPM: 500,000 tokens/minute
Large:
- RPM: 1,000+ requests/minute
- TPM: 2,000,000+ tokens/minute
Together's RPM limits reflect focus on batch processing rather than real-time inference.
Free vs Paid Tiers
Rate Limit Strategy Matrix
| Provider | Free RPM | Free TPM | Paid RPM | Paid TPM | Max Tier TPM |
|---|---|---|---|---|---|
| OpenAI | 3 | 40K | 3,500 | 2M | 30M (Tier 5) |
| Anthropic | 5 | 40K | 1,000 | 400K | 4M+ (Enterprise) |
| Cohere | 50 | 100K | 1,000 | 500K | 5M (org-level) |
| Groq | Unlimited | 14.4K | Unlimited | 144K | Custom |
| Together | 50 | 100K | 200 | 500K | 2M+ (Large) |
OpenAI and Anthropic maintain strict free tier limits. Cohere and Together more permissive for testing. Groq trades off TPM for unlimited RPM.
Free tier upgrades typically automatic once trial expires and payment added. No approval required.
Request per Minute Limits
RPM Impact on Real-Time Applications
OpenAI's 3 RPM free tier handles single user concurrent requests. Three simultaneous API calls would hit limit immediately.
Paid tiers (3,500+ RPM) enable:
- 100+ concurrent users with single request each
- Real-time chatbot serving small user base
- Batch processing 58 requests/second
RPM becomes bottleneck for high-frequency classification workloads. Example: processing 100K documents through moderation API requires 1,667 minutes at Cohere's paid 1,000 RPM limit, or 100 minutes with Together's 1,000 RPM tier.
Token per Minute Limits
TPM Impact on Throughput
TPM limits often hit before RPM in token-heavy workloads.
Example calculations (assuming 2,000 input tokens + 200 output tokens per request):
OpenAI Free (40K TPM):
- Requests accommodated per minute: 40,000 / 2,200 = 18 requests/minute
- At 3 RPM limit, TPM actually provides spare capacity
- Bottleneck: RPM (3) not TPM (40K)
Anthropic Paid (400K TPM):
- Requests accommodated per minute: 400,000 / 2,200 = 181 requests/minute
- At 1,000 RPM limit (standard paid tier), TPM provides spare capacity
- Bottleneck: RPM (1K) not TPM (400K)
Groq Free (14.4K TPM):
- Requests accommodated per minute: 14,400 / 2,200 = 6.5 requests/minute
- Groq's unlimited RPM irrelevant; TPM is actual constraint
- TPM bottleneck prevents leveraging unlimited RPM
Handling Rate Limits
Exponential Backoff Strategy
Standard approach for handling 429 responses:
import anthropic
import time
client = anthropic.Anthropic(api_key="YOUR_KEY")
def get_message_with_backoff(prompt, max_retries=5):
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-opus-4-6-20260901",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + (random.random() * 0.5)
print(f"Rate limited. Waiting {wait_time:.1f}s")
time.sleep(wait_time)
Standard backoff: 2^attempt seconds plus random jitter. Prevents thundering herd during provider recovery.
Request Queuing
For batch workloads, queue requests and process at rate limit:
from queue import Queue
import threading
import time
request_queue = Queue()
results = []
min_interval = 60 / 1000 # 1000 RPM = ~0.06s per request
def worker():
last_request_time = 0
while True:
request = request_queue.get()
if request is None:
break
elapsed = time.time() - last_request_time
if elapsed < min_interval:
time.sleep(min_interval - elapsed)
result = api_call(request)
results.append(result)
last_request_time = time.time()
request_queue.task_done()
for doc in documents:
request_queue.put(doc)
threads = [threading.Thread(target=worker) for _ in range(4)]
for t in threads:
t.start()
Four parallel workers can process 4,000 RPM with throttling. Scales linearly with worker count.
Batch API Usage
Most providers offer batch APIs with higher effective throughput:
OpenAI Batch API: 50% cost reduction, 24-hour processing window Anthropic Batch API: 50% cost reduction, 24-hour processing window Cohere Batch API: 20% cost reduction, 4-hour processing window
Batch APIs remove per-request overhead, allowing higher throughput and lower cost simultaneously.
FAQ
Q: Which provider has highest rate limits? OpenAI Tier 5 offers the highest TPM (30M TPM), though it requires significant spend history. Anthropic's standard paid tier provides 400K TPM, scaling to 4M+ at enterprise level. Groq (unlimited RPM, 144K TPM paid) is best for high-frequency low-token requests.
Q: Do rate limits reset hourly or daily? All providers use sliding window minutes. Limits reset continuously. A 1,000 RPM limit means 1,000 requests allowed in any 60-second window.
Q: What happens when I exceed rate limits? API returns 429 HTTP status (Too Many Requests). Response includes retry-after header indicating wait duration. Requests made during cooldown fail immediately with same 429 error.
Q: Can I request higher rate limits? Yes. All providers support limit increases for committed spend or on-demand requests. OpenAI increases automatically as account spend increases. Anthropic and Cohere allow explicit limit requests via API dashboard.
Q: Should I use batch APIs for production inference? No. Batch APIs require 4-24 hour latency. Suitable for non-time-sensitive workloads (daily digests, weekly reports, offline analytics). Real-time inference requires standard APIs despite higher cost.
Q: How do I calculate TPM requirements? Estimate maximum tokens per request (input + output) and maximum requests per minute. Multiply together: (avg_input_tokens + avg_output_tokens) * max_requests_per_minute = required TPM.
Related Resources
- OpenAI API Pricing
- Anthropic Claude API Pricing
- Cohere API Pricing
- Groq API Pricing
- Complete LLM API Pricing Guide
- LLM API Gateway and Router Tools
Sources
- OpenAI API Rate Limits Documentation
- Anthropic API Documentation
- Cohere API Reference
- Groq API Documentation
- Together AI API Guidelines
- Industry LLM API Comparison Report (March 2026)