Contents
- Overview
- Together AI Pricing Model
- Inference Pricing Comparison
- Fine-Tuning and GPU Rental Costs
- Open-Source Model Economics
- Competitive Analysis
- Detailed Fine-Tuning Pricing & Strategies
- Dedicated Instance Pricing & SLA Options
- Hidden Fees and Gotchas
- Cost Optimization Strategies
- FAQ
- Related Resources
- Sources
Overview
Together AI is a hosting platform for open-source language models, offering Llama 3, Mistral, Phi, Code Llama, and other community-driven models via API. Unlike OpenAI's single GPT family or Anthropic's Claude lineup, Together AI gives teams choices: run smaller efficient models cheaply, or larger capable models at moderate cost. Pricing varies by model size and architecture. This guide breaks down Together's cost structure, compares it to closed-source alternatives, and identifies when open-source models make financial sense.
Pricing Overview Table
| Model | Input | Output | Size |
|---|---|---|---|
| Llama 3.1 8B | $0.10 | $0.10 | 8B |
| Llama 3.3 70B | $0.88 | $0.88 | 70B |
| Mistral 7B | $0.15 | $0.15 | 7B |
| Mistral Medium | $0.45 | $0.45 | ~12B |
| Phi-3 | $0.05 | $0.05 | 3.8B |
| Code Llama 70B | $0.90 | $0.90 | 70B |
| Llama 2 70B | $0.75 | $0.75 | 70B |
Note: Prices as of March 2026, per 1M tokens. Rates may vary slightly; verify via Together's dashboard before production commits.
Together AI Pricing Model
Together AI charges per 1M input tokens and per 1M output tokens separately, similar to OpenAI and Anthropic. No subscription, no hidden minimums. Users pay only for what they consume.
Free Tier:
- $5/month free credits upon signup
- Limited to educational and non-commercial use
- Useful for prototyping but insufficient for production
Pay-as-developers-go:
- API key required, linked to payment method
- Charges incurred hourly, monthly invoice
- No rate limiting based on account age (unlike OpenAI's gradual tier system)
- Pricing visible in real-time dashboard
Bulk Discounts (Estimated): Together AI does not publicly advertise volume discounts as of March 2026. Contact sales for commitments above $10k/month. This is a gap compared to OpenAI's structured tier discounts.
Inference Pricing Comparison
Small Models (Sub-10B)
Phi-3 ($0.05/$0.05): Smallest and cheapest on Together. Phi is Microsoft's research model, optimized for mobile and edge inference. Suitable for simple classification, summarization, and intent detection.
Example task: classify customer support tickets (500 input tokens, 50 output tokens):
- Cost: (0.5 × $0.05) + (0.05 × $0.05) = $0.0275
Mistral 7B ($0.15/$0.15): Balanced model, faster than Llama at inference, competitive on quality for generation tasks. Mistral is open-source, self-hostable, but Together hosting removes infrastructure burden.
Same task: (0.5 × $0.15) + (0.05 × $0.15) = $0.0825
Phi-3 costs 67% less but may require more prompt engineering to match quality.
Mid-Sized Models (30-50B)
Mistral Medium (~$0.45/$0.45): Together's estimation of Mistral's larger variant (internal model, architecture not fully disclosed). Handles more complex reasoning than 7B.
Long-form content task (5k input, 2k output):
- Phi-3: (5 × $0.05) + (2 × $0.05) = $0.35
- Mistral 7B: (5 × $0.15) + (2 × $0.15) = $1.05
- Mistral Medium: (5 × $0.45) + (2 × $0.45) = $3.15
Mistral Medium's 7x cost premium over Phi-3 assumes a corresponding quality gap. For teams with tight budget constraints, Phi-3 + prompt optimization often matches Mistral Medium's output at fraction of cost.
Large Models (70B)
Llama 3.3 70B ($0.88/$0.88): Open-source flagship, competitive with closed-source models in reasoning and factuality. Meta's open release reduces moat; pricing drops quickly toward cost of compute.
Code Llama 70B ($0.90/$0.90): Code-specialized version of Llama 2, pre-trained on code corpora. Better code generation than generic Llama. Same price, different use case.
Example: code generation task with Llama 3.3 70B (2k input context, 1k output code):
- Cost: (2 × $0.88) + (1 × $0.88) = $2.64 per request
For 100 code generation requests/month: $2.64 × 100 = $264/month
Fine-Tuning and GPU Rental Costs
Together AI offers two paths for fine-tuning:
Managed Fine-Tuning (API)
Upload training data, select a base model, let Together handle the infrastructure. Pricing:
- $10/hour for GPU-hours (A100-equivalent)
- Typical fine-tune job: 1-4 hours (small datasets)
- Example: fine-tune Llama 3 70B on custom dataset = 2 hours = $20
This is cheap for experimentation. However, marginal value is low on already well-trained models. Fine-tuning is most valuable for task-specific optimization (domain adaptation, format consistency).
Dedicated GPU Rental
Reserve GPUs for custom training workflows. Pricing:
- A100 SXM: $3.99/hour
- H100 SXM: $6.99/hour
- Requires longer commitment (typically 1-month minimum)
This is more expensive than raw GPU cloud (RunPod H100 SXM: $2.69/hr; Lambda H100 SXM: $3.78/hr), but Together bundles training infrastructure with API access, valuable for teams already using the platform.
Open-Source Model Economics
The fundamental economics of open-source LLMs differ from closed-source models.
Self-Hosting Cost: Running Llama 3 70B requires 80GB VRAM minimum. Options:
- Rent GPU: A100 on RunPod $1.19/hr
- Buy GPU: Used A100 $3,000-5,000, depreciates, requires electricity and cooling
Self-hosting 24/7:
- A100 rental: $1.19 × 24 × 30 = $857.60/month
- Llama 3.3 70B on Together: $0.88 per 1M input + output tokens
At 100M tokens/month (typical startup), Together costs: (100 × $0.88) = $88/month
Together is approximately 9.7x cheaper than self-hosted A100.
Scaling Considerations: At 1B tokens/month (10x larger):
- Together: $880/month
- Self-hosted A100: $857.60/month
- Self-hosted converges with Together
At 2B tokens/month:
- Together: $1,760/month
- Self-hosted A100: still $857.60
- Self-hosted wins
The breakeven is roughly 1B tokens/month. Below that, Together is cheaper. Above that, self-hosting is more economical (assuming GPU utilization is consistent).
Competitive Analysis
Together vs OpenAI
GPT-5.4 ($2.50/$15 per 1M tokens):
- Input: ~2.8x more expensive than Llama 3.3 70B
- Output: ~17x more expensive than Llama 3.3 70B
When OpenAI wins:
- Reasoning complexity: GPT-5.4 outperforms Llama on multi-step analysis
- Multimodal: GPT-5.4 handles images; Llama does not
- Reliability: OpenAI's infrastructure, SLA guarantees, 99.9% uptime
When Together wins:
- Budget-sensitive: $0.88 vs $2.50 is significant at scale
- Latency control: Self-host Llama internally for zero-latency inference
- Data privacy: Models can run on premise, data never leaves servers
Together vs Anthropic
Claude Sonnet 4.6 ($3/$15 per 1M tokens):
- Input: ~3.4x more expensive than Llama 3.3 70B
- Output: ~17x more expensive than Llama 3.3 70B
Claude strengths:
- Constitutional AI: Safer, less biased outputs
- Extended thinking: Better reasoning on complex tasks
- Multimodal: Handles images natively
Llama strengths:
- Cheaper: 3-17x lower cost
- Self-hostable: Run locally, no API calls required
- Ecosystem: Massive community, countless fine-tunes
Together vs DeepSeek
DeepSeek pricing (if available through Together) typically undercuts both OpenAI and Anthropic, competing directly with Llama. See DeepSeek pricing guide for detailed comparison.
Detailed Fine-Tuning Pricing & Strategies
Together AI's fine-tuning offerings have distinct cost structures depending on approach.
Managed Fine-Tuning Service
Together's API handles data upload, training orchestration, and model deployment.
Cost Structure:
- Storage: Free (data stored in Together buckets)
- Compute: $10/hour GPU (A100 equivalent)
- Data preparation: Free (included in API)
Typical Fine-Tune Timeline:
Fine-tuning Llama 3 70B on domain-specific dataset (100k examples):
Step 1: Data preparation (validation, formatting): 0 hours, included Step 2: Fine-tuning run (1-4 epochs): 2-4 hours at $10/hour = $20-40 Step 3: Evaluation & testing: 0.5 hours = $5 Total cost: $25-45
Compare to OpenAI fine-tuning:
- Training: $8/hour (cheaper than Together)
- But storage and API access fees add up
- OpenAI: Total cost $15-25 for similar dataset
Together is competitive for small-scale fine-tuning but not cheaper than OpenAI.
When Managed Fine-Tuning Pays Off:
- Rapid iteration: Multiple fine-tune cycles on same dataset (test Llama 3, try Mistral, revert). $25 per cycle is low friction.
- Domain adaptation: Tuning on proprietary data without managing infrastructure.
- A/B testing: Fine-tune two variants in parallel ($50 total cost for comparison).
When It's Wasteful:
- One-time fine-tune: Spend $50 in GPU time, realize task doesn't need fine-tuning. Wasted capital.
- Large datasets (1M+ examples): Takes 20+ hours, costs $200+. Self-hosting becomes cheaper.
Dedicated GPU Rental for Custom Workflows
For teams wanting full control over training scripts (custom loss functions, architectures, training loops):
Pricing:
- A100 SXM: $3.99/hour
- H100 SXM: $6.99/hour
- Minimum commitment: 1 month (240 hours)
Monthly cost:
- A100: $3.99 × 240 = $958.20
- H100: $6.99 × 240 = $1,677.60
Raw GPU cloud providers are cheaper for comparable hardware:
- RunPod A100 SXM: $1.39/hr × 240 = $333.60 (cheaper)
- Lambda H100 SXM: $3.78/hr × 240 = $907.20
Together is more expensive than raw GPU cloud. Its value proposition is bundling GPU access with API, so teams don't manage multiple infrastructure layers.
Real-World Scenario: Custom Mixture-of-Experts Training
A team wants to train a 3-expert MoE model (3 × 12B experts + router). Standard Llama 3 fine-tuning won't work; requires custom PyTorch code.
Using Together dedicated GPU:
- Rent H100 for 1 month: $1,677.60
- Training time: 100 hours (custom training loop)
- Total cost: $1,677.60
Using self-managed cloud (RunPod):
- Rent H100 SXM for 100 hours: $2.69 × 100 = $269 (cheaper)
- But setup, debugging, monitoring is engineering time
For teams with DevOps expertise: RunPod wins. For teams without infrastructure: Together's $1,677.60/month bundles support, making engineering time savings valuable.
Improving Fine-Tuning ROI
Strategy 1: Minimal fine-tuning with prompt optimization
Test if prompt engineering alone solves the problem before fine-tuning:
- Prompt iteration cost: $10/month in API calls
- Fine-tuning cost: $25-40 per cycle
If prompt engineering achieves 85% accuracy and fine-tuning achieves 92%, is 7% improvement worth $25? Only if task volume (1M+ inferences/month) makes accuracy ROI clear.
Strategy 2: Fine-tune once, use indefinitely
Fine-tune cost: $30 Model lifespan: 6-12 months before domain drift Amortized cost: $30/12 months = $2.50/month API cost savings: If fine-tuned model reduces output tokens (more concise), saves $X/month
If fine-tuning reduces output by 20%, saves ~$176/month (at 100M tokens/month at $0.88/M), ROI is immediate.
Dedicated Instance Pricing & SLA Options
Together AI recently introduced dedicated instances for mission-critical inference. Details:
Dedicated Instance Tiers:
Starting at $500/month (minimum commitment 3 months) for guaranteed capacity. This sits between on-demand and fully reserved capacity. No public SLA published, but reduces throttling during high load.
For teams with 500M+ tokens/month, dedicated instances are worthwhile (cost ~$0.00088 per token for Llama 3.3 70B on-demand, vs dedicated instance risk of over-committing).
Hidden Fees and Gotchas
No Explicit Hidden Fees
Together AI is transparent on pricing. Input and output costs are clear. No per-request minimums, no per-month commitments (pay-as-developers-go).
Implicit Costs
Latency and Quality Trade-offs: Llama 3 is 28x cheaper than GPT-5.4 but requires more sophisticated prompting. Engineering time to optimize prompts costs money. If prompt engineering eats 20 hours/month at $100/hr, that's $2,000. Savings from cheaper API (~$1,620/month if using Llama 3.3 70B vs GPT-5.4 at high volume) evaporate.
Output Token Inflation: Some models generate verbose outputs. Llama 3 tends toward longer responses than GPT-5.4 (more tokens = higher cost). Optimize with stop sequences and temperature to limit verbosity.
Model Switching Costs: Migrating from Llama 3 to another model may require prompt retuning. Early lock-in to one provider (Together's ecosystem) creates switching costs.
API Rate Limits: Together does not impose hard rate limits but may throttle requests during high load (shared infrastructure). For mission-critical inference, consider reserved capacity (if available).
Cost Optimization Strategies
1. Right-Size the Model
Start with Phi-3 ($0.05), test on the task. Measure quality. Upgrade to Mistral 7B ($0.15) or Llama 3 8B ($0.10) only if quality gap justifies cost.
Example: customer support classification. Phi-3 achieves 92% accuracy. Llama 3 8B achieves 95%. Cost difference: $0.05 vs $0.10 per request. On 100k requests/month, upgrade costs $5,000/month additional. ROI requires 3%+ accuracy improvement to translate to revenue (fewer escalations, better satisfaction).
2. Batch Processing for Cost Efficiency
Real-time inference charges per-request overhead. Batch inference can reduce costs by 10-20% (shared model loading, better GPU utilization).
Toggle between real-time and batch:
- Real-time queries: use small models (Phi-3, Mistral 7B)
- Batch jobs: use larger models (Llama 3 70B) for deep analysis, run overnight
3. Fine-Tune Strategically
A generic Llama 3.3 70B costs $0.88 per 1M tokens. A fine-tuned version might achieve same quality at 50% fewer output tokens (more concise). Fine-tune if:
- Throughput is >100M tokens/month (break-even on optimization effort)
- Task-specific domain requires accuracy improvements
4. Implement Caching
Reuse long contexts across multiple queries. If an analyst runs 50 queries on the same document, load the document once, cache it, reuse.
Together's API supports cache tokens (charged at lower rate, ~90% discount). Caching a 100k-token document for 50 queries saves (0.1M tokens × 50 × 0.9 × $0.88) = $3.96. Minimal but compounds at scale.
5. Hybrid Architectures
Use local open-source models for simple tasks, Together API for complex ones:
- Local Phi-3 (0 cost, latency <100ms) for classification
- Together Llama 3 70B for analysis and generation
This hybrid reduces Together API spend by 50% while improving latency on simple tasks.
FAQ
Can we self-host Llama 3 70B cheaper than Together's API? At roughly 1B tokens/month, the costs converge (self-hosting on A100 costs ~$857/month, Together API at $0.88/M costs ~$880). Below that breakeven, Together is cheaper. Above it, self-hosting wins — but include engineering time to manage infrastructure.
Does Together offer volume discounts? No public volume discount tiers. Contact sales for commitments >$10k/month. This is a disadvantage vs OpenAI (which offers tier discounts) and DeepSeek (which often prices lower for high volume).
Is Llama 3.1 8B cheaper than Llama 3.3 70B? Yes. Llama 3.1 8B costs $0.10/1M tokens while Llama 3.3 70B costs $0.88/1M tokens — nearly 9x cheaper per token at the cost of model capability. Some providers offer flat-rate pricing (same price regardless of model size), Together does not.
What about RAG (Retrieval-Augmented Generation) costs? RAG involves loading large context (documents). Llama 3.3 70B at $0.88 per 1M tokens makes RAG expensive at scale. Optimize by:
- Using small models (Phi-3) for retrieval and ranking
- Using large models only for final generation
- Implementing query compression to reduce context tokens
Does Together have SLA uptime guarantees? No formal SLA published as of March 2026. OpenAI and Anthropic offer 99.9% uptime; Together offers best-effort. For mission-critical systems, plan for occasional downtime.
Can we use Together's API in production? Yes, millions of requests/month are feasible. Latency is comparable to OpenAI (200-500ms average). Reliability is good but not SLA-backed. For critical systems, implement fallback to another provider.
Does fine-tuning cost justify the ROI? Fine-tuning costs $25-40 per cycle. ROI depends on task volume and accuracy improvement. For 1M+ inferences/month, 7% accuracy improvement saves $126/month (fewer retries, better user satisfaction). For small task volumes, prompt engineering may be sufficient. Test ROI with A/B testing ($25/variant) before committing.
What's the breakeven between Together's dedicated instance and on-demand? Dedicated instance: $500/month (minimum 3 months). On-demand costs $0.88/1M tokens (Llama 3.3 70B). At ~568M tokens/month, on-demand costs $500. Above that, dedicated instances are cheaper and eliminate throttling risk. Below that, on-demand is more economical.
Related Resources
- Together AI hosting platform
- OpenAI GPT-5.4 vs Llama 3 pricing comparison
- Anthropic Claude API pricing guide
- DeepSeek API pricing and models
- Browse all LLM providers
Sources
- Together AI pricing documentation: https://www.together.ai/pricing
- Together AI API documentation: https://docs.together.ai/
- Meta Llama 3 model card: https://huggingface.co/meta-llama/Llama-3-70b
- Mistral AI model documentation: https://docs.mistral.ai/
- OpenAI pricing (March 2026): https://openai.com/pricing
- Anthropic Claude pricing: https://www.anthropic.com/pricing