Cohere API Pricing 2026: production LLM Costs

Deploybase · July 21, 2025 · LLM Pricing

Contents

Cohere API Pricing: Overview

Cohere API Pricing is the focus of this guide. Cohere charges separately for generation (Command R), embeddings (Embed v3), and reranking (Rerank). Cost breaks down by model and use case.

This guide covers pricing structure, real production examples, and comparisons to OpenAI/Anthropic.

Cohere Model Lineup

Generation Models:

Command R: Production workhorse. 128K context. Good for search and retrieval.

Command R+: Premium version. Better reasoning. Same 128K context.

Legacy Command models: Older generation available for backward compatibility, but deprecated in favor of R and R+.

Embedding Models:

Embed v3 (Small, Base, Large): Three variants. Cheap or good-pick one.

Embed 2: Prior generation, still available, lower cost than v3.

Reranking Models:

Rerank 3: Sorts search results. Cuts down LLM calls in RAG by pre-filtering documents.

Rerank 2: Earlier version, lower cost.

Pricing Breakdown by Model

Command R Generation Pricing (2026):

Input tokens: $0.15 per 1M tokens Output tokens: $0.60 per 1M tokens Context window: 128K tokens

Example calculation for single query:

  • Input: 2,000 tokens (query + retrieved documents)
  • Output: 500 tokens (generated response)
  • Cost per call: (2000 / 1,000,000 × $0.15) + (500 / 1,000,000 × $0.60) = $0.00060 per call

Command R+ Generation Pricing (2026):

Input tokens: $2.50 per 1M tokens Output tokens: $10.00 per 1M tokens Context window: 128K tokens

Example calculation:

  • Input: 2,000 tokens
  • Output: 500 tokens
  • Cost per call: (2000 / 1,000,000 × $2.50) + (500 / 1,000,000 × $10.00) = $0.01000 per call

Costs more because it's smarter.

Embed v3 Pricing (2026):

Embed v3 Small: $0.02 per 1M tokens Embed v3 Base: $0.10 per 1M tokens Embed v3 Large: $0.30 per 1M tokens

For 1,000-token document batch:

  • Small: 1,000 / 1,000,000 × $0.02 = $0.00002 (negligible)
  • Base: 1,000 / 1,000,000 × $0.10 = $0.0001
  • Large: 1,000 / 1,000,000 × $0.30 = $0.0003

Example RAG indexing: 10,000 documents at 500 tokens each (5M tokens total)

  • Using Embed v3 Small: 5M / 1M × $0.02 = $0.10 (one-time cost)
  • Using Embed v3 Base: 5M / 1M × $0.10 = $0.50 (one-time cost)
  • Using Embed v3 Large: 5M / 1M × $0.30 = $1.50 (one-time cost)

Embedding costs: negligible. Pick the quality developers need.

Rerank 3 Pricing (2026):

Rerank 3: $3.00 per 1M API calls

Per-call cost breakdown:

  • Single rerank request: $3.00 / 1,000,000 = $0.000003 per call
  • Reranking 10 candidate documents: $0.00003
  • Reranking 100 candidate documents: $0.0003

Dirt cheap. Pay per call, not per token.

Production Cost Examples

Scenario 1: Customer Support Chatbot

Assumptions:

  • 1,000 queries/day
  • Each query: 2KB input (chat history + retrieved docs)
  • Average response: 500 tokens
  • Run 30 days

Using Command R:

  • Daily input tokens: 1,000 × 2,000 = 2M tokens
  • Daily output tokens: 1,000 × 500 = 500K tokens
  • Daily cost: (2M / 1M × $0.15) + (500K / 1M × $0.60) = $0.60
  • Monthly cost: $0.60 × 30 = $18.00

Budget impact: Negligible. Even scaling to 10,000 queries/day costs $180/month for generation.

Scenario 2: Document Search Platform (RAG)

Assumptions:

  • 100K documents in knowledge base (1,000 tokens average)
  • 500 queries/day
  • Each query: retrieve 20 documents → rerank → generate answer
  • Run 30 days

Costs:

Initial indexing (one-time):

  • 100K documents × 1,000 tokens = 100M tokens
  • Embed v3 Small: 100M / 1M × $0.02 = $2.00

Monthly query processing:

  • Reranking: 500 queries/day × 30 days × 20 documents × $0.000003/call = $0.09
  • Generation (Command R): 500 queries/day × 30 days, input 5,000 tokens, output 800 tokens
    • Input: (500 × 30 × 5,000) / 1,000,000 × $0.15 = $11.25
    • Output: (500 × 30 × 800) / 1,000,000 × $0.60 = $7.20
    • Generation subtotal: $18.45

Monthly total: $18.45 + $0.09 = $18.54

Scaling to 2,000 queries/day: $74.16/month Scaling to 10,000 queries/day: $370.80/month

Scenario 3: production Content Classification

Assumptions:

  • 50,000 documents/month requiring classification
  • Each document: 2,000 tokens average
  • Use Embed v3 Base for semantic classification
  • No generation, only embeddings

Monthly cost:

  • 50,000 documents × 2,000 tokens = 100M tokens
  • Cost: 100M / 1M × $0.10 = $10.00/month

Classification is extremely cost-effective. Scaling to millions of documents remains affordable.

Scenario 4: Premium Reasoning Workload

Assumptions:

  • Research analysis platform
  • 100 complex queries/month
  • Each query: 10K input tokens (research documents + context)
  • Average output: 2,000 tokens
  • Use Command R+ for superior reasoning

Monthly cost:

  • Input: (100 × 10,000) / 1,000,000 × $2.50 = $2.50
  • Output: (100 × 2,000) / 1,000,000 × $10.00 = $2.00
  • Total: $4.50/month

Even premium reasoning at 100 queries/month costs barely more than standard chat.

Comparison to OpenAI and Anthropic

Text Generation Comparison:

January 2026 market rates (as of March 2026 data). For detailed analysis, compare against OpenAI API pricing:

OpenAI GPT-5:

  • Input: $1.25 per 1M tokens
  • Output: $10.00 per 1M tokens

OpenAI GPT-4.1:

  • Input: $2.00 per 1M tokens
  • Output: $8.00 per 1M tokens

OpenAI GPT-5 Mini:

  • Input: $0.25 per 1M tokens
  • Output: $2.00 per 1M tokens

Anthropic Claude Opus 4.6:

  • Input: $5.00 per 1M tokens
  • Output: $25.00 per 1M tokens

Anthropic Claude Sonnet 4.6:

  • Input: $3.00 per 1M tokens
  • Output: $15.00 per 1M tokens

Anthropic Claude Haiku 4.5:

  • Input: $1.00 per 1M tokens
  • Output: $5.00 per 1M tokens

Cohere Command R:

  • Input: $0.15 per 1M tokens
  • Output: $0.60 per 1M tokens

Cohere Command R+:

  • Input: $2.50 per 1M tokens
  • Output: $10.00 per 1M tokens

Cost Comparison for Customer Support Chatbot:

1,000 queries/day, 30 days, average 2K input + 500 output tokens per query:

  • OpenAI GPT-5 Mini: $82.50/month
  • Anthropic Claude Haiku 4.5: $82.50/month
  • Cohere Command R: $13.50/month
  • Cohere Command R+: $300.00/month

Cohere costs 3-5x less than equivalent OpenAI/Anthropic models for high-volume, straightforward query patterns.

Embedding Comparison:

OpenAI text-embedding-3-small: $0.02 per 1M tokens OpenAI text-embedding-3-large: $0.13 per 1M tokens Cohere Embed v3 Small: $0.02 per 1M tokens Cohere Embed v3 Large: $0.30 per 1M tokens

OpenAI embeddings are more affordable. However, Cohere embeddings integrate tightly with generation models, reducing latency for RAG workflows.

Cohere Pricing Models Explained

Pay-as-Developers-Go:

Default model. Billed monthly for actual API usage. No minimum commitment. Best for prototyping and variable workloads.

Charges appear in monthly invoices. Cost scales precisely with usage.

Volume Discounts:

Cohere offers discount tiers for committed volumes. Contact sales for custom agreements.

Example thresholds (custom per customer):

  • $5K monthly spend: 10-15% discount
  • $20K monthly spend: 20-25% discount
  • $100K+ monthly spend: custom pricing

Discounts apply to generation models primarily. Embeddings and reranking sometimes excluded. For comparison, Anthropic's pricing also offers volume discounts at scale.

Dedicated Deployments:

For large teams requiring data residency, guaranteed uptime, or isolated capacity, Cohere offers on-premise or dedicated cloud instances.

Pricing: custom, typically $50K-500K/year depending on scale and support requirements.

Consider dedicated deployment if:

  • Monthly API spend exceeds $50K
  • Regulatory requirements mandate data residency
  • Latency-sensitive applications need guaranteed response times
  • Predictable monthly costs matter more than pay-as-developers-go flexibility

Free Trial:

Cohere provides free tier: $5 in API credits monthly for new accounts. Sufficient for exploration and prototyping. Upgrades to pay-as-developers-go once credits exhaust.

Cost Optimization Strategies

Strategy 1: Choose the Right Model Size

Command R sufficient for most tasks:

  • Customer support
  • Information retrieval
  • Classification
  • Summarization

Command R+ necessary only for:

  • Complex reasoning
  • Multi-step problem solving
  • Code generation requiring expertise
  • Highly specialized domains

Switching from R+ to R cuts costs 50-70% for most workloads.

Strategy 2: Optimize Input Tokens

Shorter prompts cost less. Techniques:

Include only relevant retrieved documents. If reranking limits to top 5 documents, pass only those. Exclude irrelevant context.

Use system prompts efficiently. One-sentence instructions cost less than verbose guidelines.

Batch queries when possible. Send 10 requests simultaneously rather than serially, if application logic permits.

Strategy 3: Implement Reranking for RAG

Retrieve 20-50 candidate documents, rerank to 5-10, then generate. Reranking costs negligible ($0.00003 per 10 documents), generation costs significantly more.

Impact: 5x fewer generation calls, 90% cost reduction for RAG pipelines.

Example: Search platform with 100 queries/day:

  • Without reranking: 100 queries × (5 documents generation) = 500 generation calls
  • With reranking: 100 queries × (1 final generation) = 100 generation calls
  • Savings: 400 avoided generation calls = $0.42/day savings

Strategy 4: Embed Once, Retrieve Often

Embed documents during indexing (one-time cost). Retrieve via vector similarity search (free). Generation only when necessary.

Most RAG costs come from generation, not embedding. Embedding costs shrink to negligible.

Strategy 5: Use Bulk APIs

Cohere supports batch embedding requests. Sending 1,000 documents in single API call costs the same per token as single-document calls, but reduces latency variability and increases throughput.

Strategy 6: Monitor Usage

Cohere dashboard shows real-time token usage by model. Set spending alerts. Review high-cost queries monthly.

Identify outlier usage patterns:

  • Queries consuming 50K+ tokens (likely retrieval errors)
  • Low output token counts (wasted input cost)
  • Repeated identical queries (cache opportunities)

Advanced Cost Optimization Techniques

Prompt Caching:

Cohere doesn't offer built-in prompt caching like OpenAI. Implement application-level caching for repeated queries.

Example: customer support system receiving 20% duplicate queries

Without caching: process all 100 daily queries = $22.50/month (using Command R) With caching (20% cache hit): process 80 unique queries = $18/month Savings: $4.50/month or $54/year

For larger platforms, cache hit rates of 30-50% translate to thousands of dollars monthly.

Batch Processing Windows:

Group requests into off-peak hours (midnight to 6 AM) if latency permits. Some cloud providers offer off-peak discounts on compute.

Process 1,000 queries in single batch session: 24-hour processing window acceptable. Individual requests: require real-time response.

Cost advantage: minimal unless using discounted compute (rare for API pricing).

Model-Task Matching:

Not all tasks require Command R+. Task-specific optimization:

Simple classification: use Command R (sufficient for 95%+ accuracy) Complex reasoning: Command R+ necessary (only for 5% of workloads)

Audit all Command R+ usage. Replace with Command R where quality permits.

Result: 40-50% cost reduction for many workloads.

Data Quality Before API Calls:

Pre-filter requests to eliminate unnecessary processing.

Example: support ticket classification

Without pre-filter: process 1,000 tickets → $22.50 cost With pre-filter (remove duplicates, spam): process 800 tickets → $18 cost Savings: $4.50 per batch

Pre-filtering logic: simple keyword matching, regex, or shallow ML model.

Monthly Bill Estimation

For Small Team (1-5 developers):

  • 100-500 daily API calls
  • Mix of generation, embedding, reranking
  • Typical spend: $10-50/month
  • Growth path: Ollama for experiments, Cohere for production inference

For Scaling Startup (5-50 team members):

  • 1,000-10,000 daily API calls
  • RAG-heavy workloads with reranking
  • Typical spend: $100-1,000/month
  • Growth path: Negotiate volume discounts at $500K+ annual spend

For Production (50+ team members):

  • 10,000+ daily API calls
  • Multiple applications and use cases
  • Dedicated infrastructure for compliance
  • Typical spend: $1,000-50,000+/month
  • Growth path: production contracts with custom pricing and SLAs

Bill Calculation Template:

  1. Estimate daily API calls
  2. Estimate average input tokens per call
  3. Estimate average output tokens per call
  4. Calculate monthly (Command R): (calls × input_tokens) / 1M × $0.15 + (calls × output_tokens) / 1M × $0.60
  5. Add 10% buffer for variability

Custom Deployment and Volume Negotiations

When to Negotiate:

Monthly spend exceeds $10K: contact Cohere sales for volume discounts Required data residency: investigate on-premise deployments Predictable usage patterns: consider commitment discounts Multi-year contracts: lock in rates

Typical Negotiation Outcomes:

Volume discounts: 10-30% off list prices Commitment discounts: 15-40% for annual prepayment Dedicated deployments: custom pricing typically $100K-500K/year Support tier upgrades: included or discounted with large-volume contracts

Performance Considerations

Latency:

Command R: 500-1500ms typical latency Command R+: 600-2000ms typical latency Embeddings: 50-200ms for batch, 150-400ms single Rerank: 100-300ms

Cohere's cloud endpoints introduce slight latency compared to local inference. For applications requiring sub-500ms responses, local Ollama or GPT4All may be preferable. For non-interactive workloads (batch processing, offline analysis), cloud cost advantage outweighs latency.

API Rate Limits

Free Tier:

100 requests/minute for generation 1000 requests/minute for embeddings

Standard Plan:

1000 requests/minute for generation 5000 requests/minute for embeddings

Enterprise:

Custom limits negotiated per contract

Most applications fit comfortably within standard limits. Hitting limits usually indicates need for batch APIs or request caching.

FAQ

Q: What's the cheapest way to use Cohere? A: Use Command R (not R+) for most tasks. Implement reranking to reduce generation calls. Embed documents once during indexing, retrieve freely. Estimated minimum: $10-20/month for small production systems.

Q: How does Cohere compare to OpenAI? A: Cohere costs 3-5x less for text generation while offering competitive quality for retrieval and search tasks. OpenAI excels at reasoning and code; Cohere dominates search and RAG. Choose based on task type.

Q: Are there hidden costs? A: No. Cohere charges only for API calls: generation, embeddings, reranking. No infrastructure, storage, or subscription minimums. Dashboard shows real-time usage.

Q: Can I reduce costs with caching? A: Cohere doesn't offer automatic prompt caching like OpenAI. Implement application-level caching: store results of expensive queries, reuse for similar future queries. Redis or Memcached work well.

Q: What if I exceed my budget? A: Cohere enforces spending limits. Set limits in dashboard. Once limit reached, API returns error. No surprise bills. Can increase limit anytime.

Q: Is there a way to lock in prices? A: Yes. production contracts include price lock periods (typically 1-3 years). Minimum annual spend usually required ($100K+).

Q: How does batch pricing work? A: Batch operations (multiple embeddings in single request) cost the same per token as individual requests, but reduce overhead and improve throughput. No special bulk discount.

Q: Can I use Cohere offline? A: No. Cohere is cloud-only. For offline inference, use open-source models with Ollama or GPT4All. Trade flexibility for cost by hosting your own inference. See also LLM API pricing comparison to evaluate all options.

Q: What about refunds or credits? A: Unused monthly credits don't roll over. Spending credits requires monthly active usage. Refunds available for billing errors within 30 days.

Compare Cohere pricing and capabilities to alternatives:

Sources