Contents
- Amazon Bedrock Pricing: Overview
- On-Demand Pricing by Model (March 2026)
- Provisioned Throughput Pricing
- Claude on Bedrock vs Direct Claude API
- Llama on Bedrock vs Self-Hosted
- Cost-Per-Task Examples
- When to Use Bedrock
- Bedrock vs Direct API Pricing Matrix
- Bedrock Model Selection Guide
- Bedrock vs Self-Hosted Cost Analysis (1-Year Projection)
- Bedrock Integration Patterns
- Cost Optimization Strategies
- FAQ
- Related Resources
- Sources
Amazon Bedrock Pricing: Overview
Amazon Bedrock offers Claude (Anthropic), Llama (Meta), and Mistral models through AWS's managed inference platform. On-demand pricing ranges from $0.25 to $15 per million input tokens (prompt) and $1.25 to $120 per million output tokens (completion). Provisioned throughput (reserved capacity) costs $0.50 to $24 per hour, based on model and throughput tier.
Bedrock removes the operational overhead of running inference infrastructure. No GPU provisioning, no scaling logic, no VRAM management. The trade-off: model selection is limited to what AWS officially supports, and per-token costs are typically 1.5-2x higher than using open-source models on leased GPUs. For teams prioritizing managed simplicity over raw cost, Bedrock makes sense. For high-volume inference, direct API or self-hosted solutions are cheaper.
Compare Bedrock pricing against direct Anthropic, OpenAI, and open-source APIs on DeployBase's LLM pricing dashboard.
On-Demand Pricing by Model (March 2026)
Anthropic Claude Models
| Model | Context | Prompt $/M | Completion $/M | Best For |
|---|---|---|---|---|
| Claude Opus 4 | 200K | $15.00 | $75.00 | Complex reasoning, coding |
| Claude 3.7 Sonnet | 200K | $3.00 | $15.00 | Balanced, general-purpose |
| Claude 3.5 Haiku | 200K | $0.80 | $4.00 | Fast, cost-conscious |
Source: AWS Bedrock pricing page (March 21, 2026). Haiku pricing is lowest-cost, suitable for classification and summary tasks. Sonnet balances cost and capability. Opus handles extremely complex tasks but costs 50x more per token than Haiku.
Meta Llama Models
| Model | Context | Prompt $/M | Completion $/M | Best For |
|---|---|---|---|---|
| Llama 3.1 405B | 128K | $2.50 | $10.00 | Largest open-weight model |
| Llama 3.1 70B | 128K | $0.55 | $2.20 | Balanced open-source |
| Llama 3.1 8B | 128K | $0.08 | $0.32 | Lightweight, cost-effective |
Llama models are cheaper than Claude due to open-source licensing. 405B is competitive with Claude Opus on cost but slower on complex reasoning. 8B is the lowest-cost option for simple tasks.
Mistral Models
| Model | Context | Prompt $/M | Completion $/M | Best For |
|---|---|---|---|---|
| Mistral Large 2 | 32K | $0.81 | $2.43 | French language, extended reasoning |
| Mistral 7B | 32K | $0.14 | $0.42 | Speed, cost, simplicity |
Mistral 7B is the cheapest model on Bedrock. Limited context (32K vs 128K for Llama). Good for fast inference and simple tasks.
Provisioned Throughput Pricing
Provisioned throughput (reserved capacity) locks in lower per-token rates by committing to a throughput tier for 1 month. Costs are hourly (730 hours/month), not per-token.
Claude Provisioned Throughput
| Model | Tier | Throughput | $/Hour | $/Month (730 hrs) |
|---|---|---|---|---|
| Claude Opus 4 | 1 | 50K in/out tokens | $2.40 | $1,752 |
| Claude Opus 4 | 2 | 100K in/out tokens | $4.80 | $3,504 |
| Claude 3.7 Sonnet | 1 | 100K in/out tokens | $0.60 | $438 |
| Claude 3.7 Sonnet | 2 | 200K in/out tokens | $1.20 | $876 |
Provisioned throughput is worthwhile when monthly token consumption justifies the reservation. Calculate: (monthly_tokens * on_demand_cost_per_token) - monthly_provisioned_cost.
Example: Claude Sonnet
- On-demand: $3/M input + $15/M output = $18/M average
- 100M tokens/month × $0.018 = $1,800
- Provisioned Tier 1: $438/month
- Savings: $1,800 - $438 = $1,362/month
Provisioned is cheaper above roughly 50M tokens/month for Sonnet.
Llama Provisioned Throughput
| Model | Tier | Throughput | $/Hour | $/Month (730 hrs) |
|---|---|---|---|---|
| Llama 405B | 1 | 50K in/out tokens | $0.48 | $350 |
| Llama 405B | 2 | 100K in/out tokens | $0.96 | $701 |
| Llama 70B | 1 | 100K in/out tokens | $0.12 | $88 |
| Llama 70B | 2 | 200K in/out tokens | $0.24 | $175 |
Llama provisioned throughput is extremely affordable. Llama 70B's Tier 1 ($88/month) is worth it above ~5M tokens/month.
Claude on Bedrock vs Direct Claude API
| Factor | Bedrock | Direct Anthropic API |
|---|---|---|
| Opus 4 Prompt | $15/M | $15.00/M |
| Opus 4 Completion | $75/M | $75.00/M |
| Sonnet (3.7) Prompt | $3.00/M | $3.00/M |
| Sonnet (3.7) Completion | $15.00/M | $15.00/M |
| Haiku (3.5) Prompt | $0.80/M | $1.00/M |
| Haiku (3.5) Completion | $4.00/M | $5.00/M |
Analysis:
- Opus 4 pricing is identical on Bedrock and direct API ($15/$75 per million tokens)
- Sonnet (3.7) pricing is identical
- Haiku (3.5) is 20% cheaper on Bedrock ($0.80/$4.00) than direct Anthropic Haiku 4.5 API ($1.00/$5.00)
For Opus-heavy workloads, the direct Anthropic API is significantly cheaper. For Haiku-heavy workloads, Bedrock offers a small discount. Bedrock adds AWS infrastructure overhead (managed scaling, VPC integration, auth) — the value is integration convenience, not always lower cost.
Llama on Bedrock vs Self-Hosted
Llama 70B Inference Cost Comparison
Scenario: Serve 100M tokens per month, 24/7 operation.
Bedrock On-Demand:
- Cost: 100M tokens × ($0.55 + $2.20)/2M avg = $1,375/month
- Simplicity: Yes, zero ops
- Latency: ~500-800ms (API roundtrip included)
Self-Hosted on RunPod (1x H100):
- GPU cost: $1.99/hr × 730 = $1,453/month
- Throughput per GPU: 850 tokens/sec = ~2.2B tokens/month
- Utilization needed: 100M / 2,200M = 4.5% (oversized)
- Actual cost: $1,453 × 4.5% = $65/month
- Latency: 50-100ms (direct inference)
- Ops overhead: high (model management, scaling, monitoring)
Cost comparison: Bedrock is 21x more expensive than raw GPU cost. Self-hosting requires ops skills but scales to massive throughput cheaply. Bedrock wins on operational simplicity.
Cost-Per-Task Examples
Content Moderation (Classification)
Scenario: Review 1M user-submitted posts, output: safe/unsafe classification (30 tokens output average).
Using Claude 3.5 Haiku on Bedrock (on-demand):
- Prompt: 1M × 200 tokens (post content) × $0.0008/M = $160
- Completion: 1M × 30 tokens × $0.004/M = $120
- Total: $280
Using Llama 8B on Bedrock:
- Prompt: 1M × 200 × $0.00008/M = $16
- Completion: 1M × 30 × $0.00032/M = $9.60
- Total: $25.60
Llama 8B is 11x cheaper for simple classification. Quality may be lower; benchmark first.
Customer Support Chat (Reasoning)
Scenario: Respond to 10,000 support queries, 500 tokens input (customer message), 400 tokens output (bot response).
Using Claude 3.7 Sonnet on Bedrock (provisioned):
- Monthly allocation: 10,000 × (500 + 400) = 9M tokens
- Provisioned Tier 1 (100K tokens/hr): $438/month
- Cost per query: $438 / 10,000 = $0.044
- True quality: excellent
Using Llama 70B on Bedrock (on-demand):
- Prompt: 10,000 × 500 × $0.00055/M = $27.50
- Completion: 10,000 × 400 × $0.0022/M = $88
- Total: $115.50
- Cost per query: $0.0115
- Quality: good but lower reasoning capability
Claude provisioned is 3.8x more expensive but worth it for complex support. Llama suits simple FAQ responses.
Code Generation
Scenario: Generate code completions for 5,000 prompts (150 tokens input, 200 tokens output).
Using Claude Opus 4 on Bedrock (on-demand):
- Prompt: 5,000 × 150 × $15/M = $11.25
- Completion: 5,000 × 200 × $75/M = $75
- Total: $86.25
Using Mistral Large on Bedrock:
- Prompt: 5,000 × 150 × $0.00081/M = $0.61
- Completion: 5,000 × 200 × $0.00243/M = $2.43
- Total: $3.04
Claude Opus is 28x more expensive but produces better code (fewer errors, fewer revisions needed). Mistral is cheaper but requires more human review.
When to Use Bedrock
Bedrock Makes Sense For:
AWS-native applications. Already running on AWS, using IAM, VPC, CloudWatch. Bedrock integrates directly without additional infrastructure setup. No new layers to manage.
Managed inference at scale. Need auto-scaling without operational overhead. Bedrock handles traffic spikes automatically.
Compliance and data residency. Data stays in AWS VPC. Useful for regulated industries (finance, healthcare) requiring data locality.
Quick prototyping. Spin up a chatbot in hours, not weeks. No GPU procurement, no model serving code.
Models developers need aren't available elsewhere. Claude on Bedrock is convenient if already using AWS.
Bedrock is NOT Good For:
Cost-sensitive, high-volume inference. Self-hosting with RunPod/CoreWeave is 5-20x cheaper at scale.
Custom models or fine-tuning. Bedrock doesn't support fine-tuning. Use direct APIs or self-hosted solutions.
Latency-critical applications. Bedrock's API roundtrip adds 500-800ms. Direct inference adds 50-100ms.
Exotic model selection. Limited to Anthropic, Meta, and Mistral. If developers need Grok, DeepSeek, or other models, go elsewhere.
Bedrock vs Direct API Pricing Matrix
| Use Case | Bedrock | Direct API | Winner |
|---|---|---|---|
| Low-volume testing | $0.50-$2/day | $0.50-$2/day | Tie |
| 100M tokens/month | $1,000+ | $500-$800 | Direct API |
| 1B tokens/month | $8,000+ | $4,000-$6,000 | Direct API |
| Ops simplicity | High | Low | Bedrock |
| Latency <100ms | No | Yes | Direct API |
| AWS integration | Direct | Extra config | Bedrock |
Direct APIs are 30-50% cheaper for high volume. Bedrock wins on convenience and AWS integration.
Bedrock Model Selection Guide
Claude on Bedrock
Use Opus when:
- Complex multi-step reasoning (math, logic puzzles)
- Code generation with architectural decisions
- Long-form content generation (essays, reports)
- User-facing applications where quality is paramount
Cost: $15/M input, $75/M output. Justifies when quality prevents revision cycles or customer churn.
Use Sonnet when:
- General-purpose chatbots
- Content moderation and classification
- Summarization (article, email, meeting notes)
- Balanced cost and quality
Cost: $3/M input, $15/M output. 5x cheaper than Opus with 90% of Opus's capability.
Use Haiku when:
- Simple classification (spam, sentiment)
- Template-based generation (emails, messages)
- Batch processing with minimal reasoning
- Cost-constrained deployments
Cost: $0.80/M input, $4/M output. 40x cheaper than Opus. Quality drops on complex tasks.
Llama on Bedrock
Use 405B when:
- Model size is critical (run code that requires specific reasoning capability)
- Cost must be lower than Claude Opus
- Multilingual or non-English-primary workloads
Cost: $2.50/M input, $10/M output. 6x cheaper than Claude Opus with comparable reasoning.
Use 70B when:
- Balanced cost and quality (better than Haiku, cheaper than Sonnet)
- Production inference at scale
Cost: $0.55/M input, $2.20/M output. Sweet spot for most teams.
Use 8B when:
- Edge deployments or low-latency requirements
- High-volume, low-complexity tasks (100M+ queries/month)
- Budget-constrained research
Cost: $0.08/M input, $0.32/M output. Lowest cost open-source option.
Bedrock vs Self-Hosted Cost Analysis (1-Year Projection)
Scenario: Chatbot for SaaS Product
Requirements:
- 50M tokens/month (conversations)
- 80% input tokens (user queries), 20% output (responses)
- 12-month contract
Bedrock (Claude 3.7 Sonnet, on-demand):
- Input cost: 50M × 0.8 × $3/M = $120/month
- Output cost: 50M × 0.2 × $15/M = $150/month
- Monthly total: $270
- Annual: $3,240
- Ops cost: ~$0 (fully managed)
Self-Hosted (Llama 70B on RunPod):
- GPU cost: 1x H100 × $1.99/hr × 730 = $1,453/month
- Throughput: 850 tok/s = 2.2B tokens/month (44x what's needed)
- Utilization: 50M / 2,200M = 2.3%
- Actual cost: $1,453 × 2.3% = $33/month
- Annual: $396
- Ops cost: ~$500/month engineer time (model management, scaling, monitoring)
- Annual ops: $6,000
- Total annual: $6,396
Verdict: Bedrock is cheaper by $3,156 (52%) when ops cost is factored in.
But if the engineering team already maintains GPU clusters, marginal ops cost drops to ~$100/month ($1,200/year). Then self-hosted wins: $1,453 × 12 + $1,200 = $18,636 annual GPU cost, but shared across many applications. Bedrock still wins if usage is light.
Scenario: High-Volume Classification
Requirements:
- 1B tokens/month
- 99% input (documents), 1% output (classifications)
- 12-month contract
Bedrock (Llama 8B, provisioned):
- Provisioned Tier: 100K tokens/hour × 24 × 730 = 1.75B capacity
- Monthly cost: $88 × 1 month of tier = $88 (or $1,056/year with flexibility)
- Actual: $88 × 12 = $1,056/year
Self-Hosted (Llama 8B on RunPod, 1x H100):
- GPU cost: $1.99/hr × 730 = $1,453/month = $17,436/year
- But utilization for 1B tokens/month: (1B tokens × 1% time) / (850 tok/s × 730 hrs) = 22% utilization
- Actual cost: $17,436 × 22% = $3,836/year
- Ops cost: ~$50/month (minimal for single GPU) = $600/year
- Total: $4,436/year
Verdict: Bedrock is 4x cheaper ($1,056 vs $4,436) for high-volume tasks. Provisioned throughput becomes economical above ~100M tokens/month.
Bedrock Integration Patterns
Pattern 1: Lambda + Bedrock
AWS Lambda functions invoke Bedrock for serverless inference. Scales automatically with request volume.
Cost model: Pay for Lambda compute (usually negligible) + Bedrock token consumption.
Good for: Event-driven applications (image upload triggers tagging, user signup triggers welcome email).
Pattern 2: SageMaker + Bedrock
Use SageMaker notebooks for development, Bedrock for production inference.
Cost model: Development in SageMaker (notebook rental + storage), production on Bedrock (per-token).
Good for: Teams prototyping custom models, then switching to managed inference.
Pattern 3: EC2 + Bedrock via VPC
EC2 application servers call Bedrock over VPC, avoiding internet egress costs.
Cost model: EC2 instance rental + Bedrock tokens (no egress charges).
Good for: Applications requiring extremely low latency to Bedrock or strict data residency.
Cost Optimization Strategies
1. Batch Processing
Process requests in batches during off-peak hours. If latency tolerance is 12 hours, batch overnight.
Example: 1M classification requests processed at 10K/batch = 100 batches = 1 Bedrock API call per batch (if batching supported). Reduces API overhead.
Savings: 10-30% depending on implementation.
2. Model Downgrading
Start with Sonnet. If benchmarks show Haiku (40% cheaper) performs adequately, switch.
Example: Sentiment classification task. Benchmark: Sonnet 95% accuracy, Haiku 94% accuracy. Savings: 40% of token cost. Worth it? Depends on error cost (misclassified positive sentiment costs reputation).
Savings: 20-60% depending on task.
3. Quantization for Self-Hosted
If considering self-hosting, quantize models to 4-bit or 8-bit to fit on fewer GPUs, reducing cost.
Example: Llama 70B quantized to 4-bit fits in 35GB VRAM (single H100 instead of 2). Saves 50% GPU cost with <1% quality loss.
Savings: 20-50% GPU cost (self-hosted only).
4. Provisioned Throughput for Predictable Workloads
If token consumption is predictable and >100M/month, lock in provisioned throughput.
Example: SaaS product with 100K daily active users, 100 tokens/user = 10M tokens/day = 300M/month. Provisioned throughput saves 40-60% vs on-demand.
Savings: 40-60% for high-volume, predictable workloads.
FAQ
Is Bedrock cheaper than OpenAI?
No. OpenAI GPT-5 costs $1.25-$15/M input, $10-$120/M output. Bedrock Claude Opus costs $15/M input, $75/M output. Similar price range. Bedrock Llama 70B ($0.55/$2.20) is cheaper than any OpenAI model.
Can I fine-tune models on Bedrock?
No. Bedrock doesn't support fine-tuning. If you need custom models, use SageMaker (AWS) or direct APIs with fine-tuning support (Anthropic, OpenAI, Mistral).
What about Bedrock's knowledge cutoff?
Claude 3.5 on Bedrock has a cutoff similar to the direct API (~April 2025 as of March 2026). Same limitations apply.
Does Bedrock support vision (images)?
Yes, Claude Opus 4/Sonnet models support vision on Bedrock. Pricing includes image token costs (~3 tokens per image chunk).
Should I use provisioned throughput?
Yes, if monthly token consumption exceeds the breakeven threshold. For Claude Sonnet: ~50M tokens/month. For Llama 70B: ~5M tokens/month. Calculate before committing.
Can I switch between on-demand and provisioned?
Yes. Provisioned throughput is month-to-month. Switch models/tiers monthly. Recommended: start on-demand to measure real usage, then lock in provisioned if usage is consistent.
What if I exceed provisioned throughput capacity?
Bedrock throttles requests (doesn't error, just queues them). Latency increases. Increase tier or switch to on-demand for burst capacity.
Related Resources
- Amazon Bedrock Models
- Anthropic Claude Pricing
- OpenAI Pricing Comparison
- DeepSeek Pricing Guide
- LLM Pricing Calculator
Sources
- AWS Bedrock Pricing
- AWS Bedrock Model Documentation
- Anthropic Claude API Pricing
- Meta Llama Models License
- DeployBase LLM Pricing API (tracked March 21, 2026)