Contents
- AI Cost Calculator: Overview
- LLM API Cost Formula
- GPU Rental Cost Formula
- Training Cost Estimation
- Inference Cost Estimation
- Cost Optimization Examples
- Budget Planning Worksheet
- FAQ
- Related Resources
- Sources
AI Cost Calculator: Overview
Use an AI cost calculator to estimate real spending. Most teams nail LLM API costs but miss GPU infrastructure entirely-30-50% budget shortfalls. Two buckets: LLM APIs (ChatGPT, Claude) and GPU rental (training, inference, batch). Input tokens, output tokens, GPU-hours-multiply them by pricing, add it up.
LLM API Cost Formula
Basic Calculation
Cost = (Input Tokens × Input Rate + Output Tokens × Output Rate) / 1,000,000
Pull input and output rates from the provider's pricing page.
Example: ChatGPT API
Using GPT-4o Mini on a customer support task.
Inputs:
- Customer message: 200 tokens
- System prompt: 150 tokens
- Context (past messages): 100 tokens
- Total input: 450 tokens
Outputs:
- Response: 200 tokens
Pricing (as of March 2026):
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens
Cost = (450 × $0.15 + 200 × $0.60) / 1,000,000 = $0.000195
Per interaction: ~$0.0002 (one-fifth of a cent).
Monthly at 1,000 interactions: $0.195 (negligible). Monthly at 1M interactions: $195.
Scaling: Multi-Step Workloads
Not all tasks are single-turn. A reasoning workflow might involve:
- Classify query (200 input + 50 output tokens)
- Retrieve context (500 input + 100 output)
- Generate answer (600 input + 400 output)
Total: 1,300 input + 550 output tokens per workflow.
Cost = (1,300 × $0.15 + 550 × $0.60) / 1,000,000 = $0.000525
At 100,000 workflows/month: $52.50.
GPU Rental Cost Formula
Hourly Cost Calculation
Cost = GPU Price/Hour × Hours Rented
GPU prices vary significantly by provider, model, and form factor (on-demand vs spot).
On-Demand Pricing (as of March 2026):
- RunPod A100 PCIe: $1.19/hour
- RunPod H100 SXM: $2.69/hour
- Lambda H100 SXM: $3.78/hour
- CoreWeave H200 8x cluster: $50.44/hour (=$6.31 per GPU)
Spot Pricing (50-65% discounts):
- RunPod A100 spot: $0.42/hour
- RunPod H100 spot: $1.29/hour
Example 1: Fine-Tuning on RunPod (On-Demand)
Task: Fine-tune Mistral 7B on 100K customer examples.
Hardware: 1x RunPod A100 PCIe
- Estimated time: 8 hours
- Hourly cost: $1.19
- Total cost: 8 × $1.19 = $9.52
Scaling to 10 fine-tuning jobs/month: $95.20/month
Optimization: Use spot instances instead.
- Spot cost: 8 × $0.42 = $3.36 (65% savings)
- Account for 1 interruption: assume job restarts once, add 4 hours = 12 total hours
- New cost: 12 × $0.42 = $5.04
Monthly savings with spot: $95.20 (on-demand) vs $50.40 (spot + interruption overhead) = $44.80 saved per month (47% reduction).
Example 2: Cluster Costs (Multi-GPU)
Task: Pre-train a 70B model from scratch.
Hardware: 16x H100 SXM cluster (RunPod, on-demand)
- Per-GPU cost: $2.69/hour
- Cluster cost: 16 × $2.69/hour = $43.04/hour
- Estimated training time: 14 days (336 hours, continuous)
- Total compute cost: 336 × $43.04 = $14,461
Total AI Cost Breakdown:
- GPU rental: $14,461
- Data storage (egress): ~$200 (1TB downloaded, $0.02/GB)
- Monitoring and logging: ~$50 (CloudWatch, custom dashboards)
- Software licenses: $0 (open-source toolchain)
- Total: ~$14,711
Cost Optimization: Use on-demand for first 100 hours (get bugs out), then spot for remaining 236 hours.
- First 100 hours on-demand: 100 × $43.04 = $4,304
- Remaining 236 hours on spot at $1.29/GPU = 236 × (16 × $1.29) = $48,806
- Oops: spot is more expensive due to high demand for H100s during training season
Better optimization: Use A100 cluster
- 32x A100 SXM on-demand: 32 × $1.39/hour = $44.48/hour
- Training time: 21 days (504 hours) at 2/3 the throughput
- Total cost: 504 × $44.48 = $22,417
Paradox: A100 cluster is more expensive in total ($22k vs $14.5k) because it needs 2x the GPUs for longer training time. H100's higher hourly cost is offset by faster throughput. Choose H100 if time-to-training matters (product deadline). Choose A100 if cost is the constraint.
Training Cost Estimation
Parameter Count to Training Time Estimate
Scaling law: 1B parameter model takes 1-2 GPU-days to pre-train on 100B tokens. This is a useful heuristic for cost planning.
70B parameters on 1T tokens:
- Estimated compute: 70 GPU-days per cluster GPU
- On 8x A100 SXM cluster: 70 / 8 = 8.75 days of training
- Hourly cost: 8 × $1.39 = $11.12/hour
- Total: 8.75 × 24 × $11.12 = $2,338
Validation against public benchmarks:
- DeepSeek-V3 (671B parameters) reported 16.4M GPU-hours total compute
- Extrapolating to 70B: (70B / 671B) × 16.4M = 1.71M GPU-hours
- On-demand cost at $1.39/hour: 1.71M × $1.39 = $2.38M
Industry reference: Large language model training costs have ranged from $100k (13B models) to $50M+ (500B+ models) since 2023. A 70B model typically costs $1-5M in cloud GPU rental. The exact cost depends on hardware choices (A100 vs H100 vs H200), cluster size, and optimization.
Fine-Tuning Cost (Lower Bound)
Fine-tuning a 7B model on 100K examples:
Time estimate: 8-12 hours on single A100. Cost: 10 hours × $1.19 = $11.90
Scaling to 50 models/month: $595.
Mixed Precision Training (Cost Reduction Strategy)
Using FP8 or INT8 precision instead of FP32 reduces memory usage and training time. This is critical for cost optimization at scale.
Precision Trade-offs:
- FP32 (full precision): 4 bytes/parameter, slowest, best accuracy
- TF32 (Tensor Float): 4 bytes/parameter, faster, negligible accuracy loss
- FP16 (half precision): 2 bytes/parameter, 2x faster, 1-2% accuracy loss
- FP8 (8-bit): 1 byte/parameter, 4x faster, 2-5% accuracy loss (requires calibration)
- INT8 (8-bit integer): 1 byte/parameter, 4x faster, requires post-training quantization
Cost-Accuracy Frontier:
Training a 13B model on 100B tokens:
| Precision | Time | Cost @ H100 | Accuracy | Choice |
|---|---|---|---|---|
| FP32 | 10 hours | $19.90 | 100% | Baseline |
| TF32 | 8 hours | $15.92 | 99.8% | Recommended (no trade-off) |
| FP16 | 5 hours | $9.95 | 98.5% | Good for cost-sensitive |
| FP8 | 2.5 hours | $4.98 | 96.5% | Risky for accuracy-critical |
Real-world impact: Training a 70B model 1T tokens with TF32 instead of FP32:
- Savings: 20% compute time = 20% cost reduction
- Cost reduction: $2,338 → $1,870
- Accuracy: negligible difference (<0.2%)
TF32 is the sweet spot. FP8 is only acceptable for inference or when fine-tuning (not pre-training).
Inference Cost Estimation
Example 1: Batch Inference (High Throughput)
Task: Process 1M customer reviews with sentiment labels (positive/negative).
Setup A: Self-hosted on RunPod L40S Hardware: RunPod L40S (48GB VRAM, $0.79/hour)
- Model: Mistral 7B quantized (fits in 48GB easily)
- Batch size: 32 reviews at a time
- Throughput: ~1,000 samples/hour (30ms per review)
- Total time: 1,000 hours
- GPU cost: 1,000 × $0.79 = $790
- Additional costs:
- Data storage (1M reviews @ 500 bytes each): $250 one-time (S3)
- Monitoring and logging: $20
- Total: $1,060
Setup B: LLM API (Claude Haiku)
- Input: 1M reviews at ~200 tokens each = 200M input tokens
- Output: 1 token per review (sentiment label) = 1M output tokens
- Cost: (200M × $1 + 1M × $5) / 1,000,000 = $200 + $5 = $205
- Latency: 100-200ms per review (acceptable for batch)
- Total: $205
LLM API Advantage: LLM API is 5x cheaper ($205 vs $1,060). No infrastructure management, immediate results. Break-even is at 4-5 batches/month. If processing reviews is a recurring task >1x/month, LLM API wins decisively.
Self-hosted Advantage: If processing 20+ datasets/month, the GPU becomes economical ($790 per dataset vs $205, but reused across all datasets).
Example 2: Real-Time Inference (Low Latency Chat API)
Task: Serve a chat API to 100 concurrent users in production.
Requirement: <1 second response latency (P95), 99.9% uptime.
Setup A: Self-hosted on RunPod GPUs Hardware: 2x RunPod H100 PCIe ($1.99/hour each)
- Cost per hour: 2 × $1.99 = $3.98
- Monthly cost (24/7 operation): $3.98 × 730 = $2,904
- Additional costs:
- Data egress: 100 users × 500 requests/day × 500 bytes = 25GB/month = $0.50
- Monitoring + alert system: $50/month
- Total monthly: $2,954.50
Setup B: LLM API alternative (GPT-4o Mini)
- Assume 500 messages/user/month = 50,000 messages
- Average 250 input + 150 output tokens per message
- Token cost: (50,000 × (250 × $0.15 + 150 × $0.60)) / 1,000,000 = $8.25
- Latency: 200-500ms (meets requirement)
- Uptime: 99.99% (OpenAI SLA)
- Total monthly: $8.25
Comparison: LLM API is 358x cheaper than self-hosted GPU ($8.25 vs $2,954). The break-even point is 10M+ tokens/month. Below that, LLM API wins decisively. Above that, self-hosted becomes economical but operationally complex.
Real-world decision: Most teams under 1000 concurrent users use LLM APIs. The operational overhead of self-hosting (on-call engineering, infrastructure scaling, compliance) outweighs the cost savings.
Cost Optimization Examples
Example 1: Classification Pipeline
Task: Classify 10M customer emails weekly (10M per week = 1.43M/day).
Option A: Fine-tuned GPT-4o Mini (LLM API)
Fine-tuning cost: $2,000 (one-time). Inference cost: Per email, 500 input tokens + 10 output tokens.
Cost: (10M × (500 × $0.15 + 10 × $0.60)) / 1,000,000 = $7,500/week
Annual: $390,000
Option B: Local Mistral 7B on GPU rental
Fine-tuning: 20 hours on A100 = $23.80. Inference: RunPod L40S, process 10M at 1K samples/hour = 10,000 hours/week.
Cost: 10,000 × $0.79 = $7,900/week
Annual: $411,000
Option C: Batch API (50% discount)
Same as Option A but use OpenAI's Batch API for 50% discount.
Cost: $7,500 × 0.5 = $3,750/week
Annual: $195,000 (50% cheaper than A, cheaper than B)
Recommendation: Batch API if latency tolerance is 12-24 hours. Otherwise, fine-tuned LLM API.
Example 2: Multi-Stage RAG System
Task: 100,000 document queries/month with context retrieval.
Stage 1: Embedding + retrieval (1 embedding, 10 documents retrieved). Stage 2: Summarization of retrieved docs. Stage 3: Answer generation.
Token Count Per Query:
- Stage 1 (embedding): 2,000 tokens input, 1,500 output (summaries).
- Stage 2 (summarization): 5,000 tokens input, 300 output.
- Stage 3 (generation): 6,000 tokens input, 400 output.
- Total: 13,000 input + 2,200 output tokens/query.
Cost Calculation (Claude Haiku at $1/$5):
Cost per query: (13,000 × $1 + 2,200 × $5) / 1,000,000 = $0.02310
100,000 queries: 100,000 × $0.02310 = $2,310/month
Optimization: Caching
Assume 20% of queries share the same 5,000 tokens of context. Cached tokens cost 90% less.
Savings per cached query: 5,000 × $1 × 0.9 / 1,000,000 = $0.0045
At 20% of 100,000 = 20,000 cached queries: 20,000 × $0.0045 = $90 savings.
New monthly cost: $2,310 - $90 = $2,220 (3.9% savings).
Second optimization: Output length
Reduce answer length from 400 to 100 tokens. Saves 75% on Stage 3 output tokens.
Original Stage 3 cost: (6,000 × $1 + 400 × $5) / 1,000,000 = $0.008 New Stage 3 cost: (6,000 × $1 + 100 × $5) / 1,000,000 = $0.0065
Per query savings: $0.0025
At 100,000 queries: $250 savings.
New monthly cost: $2,220 - $250 = $1,970 (14.7% total reduction).
Budget Planning Worksheet
Step 1: Estimate Monthly Workload
| Task | Volume | Input Tokens/Task | Output Tokens/Task |
|---|---|---|---|
| Chat | 50,000 | 500 | 200 |
| Classification | 1,000,000 | 300 | 20 |
| Summarization | 100,000 | 2,000 | 500 |
| Code Generation | 10,000 | 1,500 | 800 |
Step 2: Calculate Total Tokens
Chat: 50,000 × (500 input + 200 output) = 35M input, 10M output Classification: 1,000,000 × (300 + 20) = 300M input, 20M output Summarization: 100,000 × (2,000 + 500) = 200M input, 50M output Code Gen: 10,000 × (1,500 + 800) = 15M input, 8M output
Total: 550M input tokens, 88M output tokens
Step 3: Select Models by Task Tier
- Chat: GPT-4o Mini ($0.15/$0.60)
- Classification: Gemini 1.5 Flash ($0.075/$0.30)
- Summarization: Claude Sonnet 4.6 ($3/$15)
- Code Gen: GPT-4o ($2.50/$10)
Step 4: Calculate Cost Per Task
Chat: (35M × $0.15 + 10M × $0.60) = $11,250 Classification: (300M × $0.075 + 20M × $0.30) = $28,500 Summarization: (200M × $3 + 50M × $15) = $1,350,000 Code Gen: (15M × $2.50 + 8M × $10) = $117,500
Total Monthly: $1,507,250
This is high because summarization uses expensive Sonnet. Downgrade to Claude Haiku ($1/$5):
Summarization: (200M × $1 + 50M × $5) = $450,000
Revised Total: $607,250/month
FAQ
How accurate is a cost estimate?
Within 20% if you have real data (actual token counts). If guessing, expect 30-50% variance. Token count is the biggest unknown. Real-world input varies by task (short emails vs long documents).
Should I account for errors and retries?
Yes. If a model fails 5% of the time and you retry, budget 5% extra. If context lengths vary, budget worst-case (longest documents).
What about GPU overhead costs (storage, networking)?
GPU rental typically includes storage. Network ingress is usually free; egress costs $0.02-0.10 per GB. For a 70B model output at 100 tokens/sec for 1 hour: ~10MB, negligible cost.
Add 5-10% for monitoring, logging, and operational overhead.
Is buying GPUs cheaper than renting?
For continuous workloads, yes. A100 80GB costs $10,000-15,000 new; at $1.19/hr rental, break-even is 8,400-12,600 hours (11-16 months of continuous use). If running 24/7 for 18+ months, buy. If sporadic or under 18 months, rent.
How do I forecast cost growth as volume scales?
Cost is linear with volume. If 100k queries cost $2,310, then 1M queries cost $23,100. The only exception: volume discounts (Anthropic at $1M+/month) break linearity.
Model 10x volume growth, then negotiate discounts at 3x and 8x volume targets.
What if models change pricing?
Impossible to predict. OpenAI has dropped prices 10x since GPT-3. Build cost estimates with a 20% variance buffer. Review quarterly.
Related Resources
- GPU Pricing Comparison
- Spot GPU Pricing Guide
- GPU Cloud Cost Comparison
- LLM Token Cost Comparison
- OpenAI API Pricing