Contents
- Overview
- Azure OpenAI Pricing Models
- PTU Costs
- Pay-as-Developers-Go Costs
- PTU vs On-Demand
- Data Residency
- Cost Optimization
- Real-World Examples
- FAQ
- Related Resources
- Sources
Overview
Azure OpenAI has two pricing modes: PTU (Provisioned Throughput Units) and pay-as-developers-go.
PTU requires upfront commitment but saves 30-50% per token. Pay-as-developers-go has zero commitment, works for variable loads. Same models as OpenAI API (GPT-4, Turbo, 4o).
Production workloads with stable throughput? PTU wins. Experimental or bursty? Pay-as-developers-go. Azure costs roughly 15-20% more per token than OpenAI API, but compliance, data residency, and bulk pricing matter for regulated industries.
Azure OpenAI Pricing Models
Pay-as-Developers-Go (On-Demand)
Same as OpenAI API pricing but hosted on Azure infrastructure.
| Model | Input $/M | Output $/M | Context | Max Output |
|---|---|---|---|---|
| GPT-4 Turbo | $10.00 | $30.00 | 128K | 4K |
| GPT-4o | $2.50 | $10.00 | 128K | 16K |
| GPT-4o Mini | $0.15 | $0.60 | 128K | 16K |
| GPT-4.1 | $2.00 | $8.00 | 1.05M | 32K |
| GPT-4.1 Mini | $0.40 | $1.60 | 1.05M | 32K |
No minimums. Pay per token (prompt and completion metered separately).
Monthly cost example: 100M tokens (60M prompt, 40M completion).
- GPT-4o: (60 × $2.50) + (40 × $10.00) = $150 + $400 = $550/month
Provisioned Throughput Units (PTU)
Reserve capacity upfront. Lower per-token rate in exchange for minimum commitment.
PTU: Unit of throughput. 1 PTU roughly = 1,000 tokens per minute (TPM).
| Model | PTU Cost/Hour | Hourly Throughput | Monthly Cost (730 hrs) |
|---|---|---|---|
| GPT-4 Turbo | $100 | 1M TPM | $73,000 |
| GPT-4o | $40 | 1M TPM | $29,200 |
| GPT-4o Mini | $10 | 1M TPM | $7,300 |
| GPT-4.1 | $80 | 1M TPM | $58,400 |
| GPT-4.1 Mini | $20 | 1M TPM | $14,600 |
PTU pricing is per hour, not per token. Flat rate regardless of actual usage. Reserve only what users need; unused PTU is waste.
Minimum commitment: 1 hour (can be deleted after 1 hour, but rare for production).
PTU Costs
Understanding Throughput
1 PTU = ~1,000 tokens per minute (TPM) sustained throughput.
For a 7B parameter model:
- Batch inference (summarization, classification): high throughput, variable latency
- Interactive chat (low batch size): lower throughput, strict latency SLA
Throughput calculation:
- Tokens/minute = average request size × requests/minute
- Example: 500 token requests, 100 requests/minute = 50,000 TPM = 50 PTUs
Cost Structure
PTU pricing includes:
- Base hourly rate (varies by model)
- No per-token overage (use as much as reserved capacity allows)
- Quota increase (scale to higher TPM by adding PTUs)
Once users exceed the PTU reservation, requests are throttled (delayed) rather than rejected. Developers can upgrade PTU anytime (instant, no downtime).
PTU Sizing
| Use Case | Throughput | PTUs Needed | Cost/Month |
|---|---|---|---|
| Low (one user, async) | 100 TPM | 0.1 | $2,920 |
| Medium (100 users, async) | 10K TPM | 10 | $292,000 |
| High (1000 users, real-time) | 500K TPM | 500 | $14.6M |
| Extreme (multi-region, massive) | 2M TPM | 2000 | $58.4M |
PTU costs scale linearly. For very high throughput, PTU is much cheaper than pay-as-developers-go (which would cost millions).
Pay-as-Developers-Go Costs
Per-Token Pricing
Based on OpenAI API rates, with minor regional variation (Azure might be ±5%).
Example Workload: Document Classification (100M tokens/month)
Task: Classify 1M documents (500 token input, 50 token output).
- Inputs: 1M × 500 = 500M tokens
- Outputs: 1M × 50 = 50M tokens
Cost on Azure OpenAI (GPT-4o):
- Input: 500M × $2.50/M = $1,250
- Output: 50M × $10.00/M = $500
- Total: $1,750/month
Cost on OpenAI API (GPT-4o):
- Same rate
- Total: $1,750/month
Azure and OpenAI API cost the same for pay-as-developers-go. The Azure advantage is in PTU or compliance features.
Overages and Limits
No overage charges. But Azure enforces rate limits (TPM quotas). If users exceed the quota, requests are throttled:
- Quota: 40,000 TPM by default
- Exceed: Requests queued, delayed, or rejected
Increase quota by requesting PTU or by contacting Azure support.
PTU vs On-Demand
Break-Even Analysis
PTU becomes cheaper when monthly token usage × on-demand rate exceeds PTU cost.
Example: GPT-4o
| Monthly Tokens | On-Demand Cost | PTU Cost (40/hr) | Winner |
|---|---|---|---|
| 10M | $200 | $29,200 | On-Demand |
| 100M | $2,000 | $29,200 | On-Demand |
| 500M | $10,000 | $29,200 | On-Demand |
| 1B | $20,000 | $29,200 | On-Demand |
| 2B | $40,000 | $29,200 | PTU |
| 5B | $100,000 | $29,200 | PTU |
PTU break-even: ~2B tokens/month for GPT-4o.
For teams processing 2B+ tokens monthly, PTU saves money. Below 2B, on-demand is cheaper.
Workload Types
On-Demand Best For:
- R&D and experimentation (unpredictable volume)
- Bursty loads (high traffic for 1-2 hours, then idle)
- Low-volume production (<500M tokens/month)
- Cost-sensitive early-stage startups
- One-off projects
PTU Best For:
- Predictable, high-volume production (2B+ tokens/month)
- SaaS platforms serving thousands of users
- Content generation at scale
- 24/7 always-on services
- Teams with forecasted budget
Switching Between Models
PTU is per-model. If users need GPT-4o and GPT-4o Mini:
- Separate PTUs: One for GPT-4o ($40/hr), one for GPT-4o Mini ($10/hr)
- Total cost: $50/hr
Alternatively, use on-demand for one and PTU for the other (hybrid).
Data Residency
Azure Advantage
Azure OpenAI stores data in specific regions (no US-only default like OpenAI API).
Availability by Region (as of March 2026):
- East US, West US: United States
- West Europe: European Union (GDPR-compliant)
- Southeast Asia: Singapore
- Japan East: Japan
- UK South: United Kingdom
Data stays in the specified region. OpenAI API defaults to US, with limited options.
Compliance Benefits
GDPR (EU): If processing EU resident data, Azure EU regions are preferred (Microsoft guarantees compliance).
HIPAA (Healthcare): Azure OpenAI supports HIPAA business associate agreements (BAA). OpenAI API does not.
SOC 2 Type II: Azure certified. OpenAI API has SOC 2, but Azure adds stronger guarantees.
FedRAMP (US Government): Azure Government Cloud available. OpenAI API not available for government use.
Cost Impact of Residency
Regional pricing is similar (within 5%). EU regions may cost slightly more.
| Region | GPT-4o Input $/M | GPT-4o Output $/M |
|---|---|---|
| East US | $2.50 | $10.00 |
| West Europe | $2.70 | $10.80 |
| Southeast Asia | $2.50 | $10.00 |
The compliance/residency benefit is worth the cost for regulated industries (healthcare, finance, government).
Cost Optimization
Strategy 1: Right-Size PTU Reservation
Avoid over-provisioning. Monitor actual usage (Azure Portal, PTU utilization metrics). Right-size quarterly.
Current: Provisioned 50 PTUs
Actual Peak Usage: 30 PTUs (60% utilization)
Waste: 20 PTUs × $40/hr × 730 = $584,000/year
Action: Reduce to 35 PTUs, save ~$292,000/year
Strategy 2: Use Cheaper Models When Possible
GPT-4o Mini is 90% cheaper than GPT-4o Turbo on most tasks. Benchmark:
- Classification, summarization, simple Q&A: Use Mini. Save 90% cost.
- Math, complex reasoning, creative writing: Use Turbo or GPT-4.1. Worth the cost.
Example: Classify 1M documents.
- Using GPT-4o Turbo: 500M inputs × $10 + 50M outputs × $30 = $6,500
- Using GPT-4o Mini: 500M inputs × $0.15 + 50M outputs × $0.60 = $810
Mini saves $5,690/month (87% cheaper).
Strategy 3: Batch Processing
Azure OpenAI Batch API (preview) offers 50% discounts for non-real-time processing.
- Submit: 1M token requests in batch
- Wait: Up to 24 hours for results
- Save: 50% per token
Cost: 1M inputs × $0.25 × 50% = $125 (vs $2.50 on-demand).
Caveat: Only for workloads that can tolerate 24-hour latency (reports, analysis, not chat).
Strategy 4: Hybrid Model Mix
Use different models for different tasks:
| Task | Model | Cost |
|---|---|---|
| Chat | GPT-4o Mini | $0.15/$0.60 per M tokens |
| Classification | Mini | $0.15/$0.60 |
| Creative writing | GPT-4o | $2.50/$10.00 |
| Math/reasoning | GPT-4.1 | $2.00/$8.00 |
Most workloads can use Mini. Only 10-20% require premium models. Overall cost: 60-70% cheaper than all-Turbo.
Strategy 5: Caching (Preview)
Azure OpenAI Prompt Caching reduces costs for repeated prompts.
- First request: Full cost
- Subsequent requests (same prompt prefix): 90% discount on cached portion
Example: System prompt (1K tokens) + user prompt (100 tokens).
- First request: 1100 tokens @ normal rate
- Requests 2-100: 100 tokens @ normal rate + 1K tokens @ 10% cost (cached)
Savings compound for repetitive workloads (summarization, classification with fixed instructions).
Real-World Examples
Example 1: SaaS Content Generation Platform
A writing assistant generating summaries and rewrites for 10K daily active users.
Load:
- 10K users × 5 requests/day × 200 tokens avg = 10M tokens/day = 300M tokens/month
- Split: 200M input, 100M output (GPT-4o)
On-Demand Cost:
- Input: 200M × $2.50/M = $500
- Output: 100M × $10.00/M = $1,000
- Total: $1,500/month
PTU Cost:
- Need: 10M tokens/day ÷ 1440 min = 6,944 TPM ≈ 7 PTUs
- Cost: 7 × $40/hr × 730 = $204,400/month
- Total: $204,400/month
Winner: On-demand. PTU is overkill. Only consider PTU if scale to 50K+ users or switch to batch processing.
Example 2: Production Legal Document Review
A law firm reviewing 100K documents/month (500 tokens each, 50 token extraction).
Load:
- 100K × 500 = 50M input tokens
- 100K × 50 = 5M output tokens = 55M tokens total/month
On-Demand Cost (GPT-4o Mini for extraction, GPT-4 for complex review):
- Mini: 50M × $0.15 + 5M × $0.60 = $10,500
- Total: $10,500/month
PTU Cost:
- 50M tokens ÷ 30 days ÷ 1440 min = 23 TPM ≈ 0.02 PTUs
- Minimum: 1 PTU × $40/hr × 730 = $29,200/month
- Total: $29,200/month
Alternative: Batch Processing
- 50M inputs × $0.15 × 50% (batch discount) = $3,750
- 5M outputs × $0.60 × 50% = $1,500
- Total: $5,250/month (latency: up to 24 hours)
Winner: Batch processing. Save $5,250/month vs on-demand ($55K vs on-demand $10.5K), and only 24-hour wait.
Example 3: Multilingual Customer Support
A support platform handling 1M customer queries/month in 10 languages. Task: translate + classify intent.
Load:
- 1M queries × 200 tokens = 200M input
- Classification + 50 token response = 50M output
- Total: 250M tokens/month
Costs:
- On-demand GPT-4o Mini: (200M × $0.15) + (50M × $0.60) = $30K + $30K = $60,000/month
- PTU (250M ÷ 30 ÷ 1440 min = 174 TPM ≈ 0.2 PTUs, but min 1): $40/hr × 730 = $29,200/month
Winner: PTU. Saves $30.8K/month despite overkill on throughput (under-utilized). If volume grows to 2B tokens/month (8x), PTU becomes clear winner.
FAQ
How much can I save with PTU?
30-50% per token at high volume (2B+ tokens/month). Below 1B tokens/month, on-demand is cheaper (no PTU commitment).
Is Azure OpenAI slower than OpenAI API?
No. Same models, same latency. Azure hosting may vary slightly by region (EU regions might add 5-10ms), but negligible.
Can I use Azure OpenAI with on-demand billing?
Yes. Pay-as-you-go is the default. PTU is optional, only for high-volume production.
What happens if I exceed my PTU allocation?
Requests are throttled (delayed) until you add more PTU capacity or usage decreases. No overage charges. No hard rejection.
Can I cancel PTU anytime?
PTU requires 1-hour minimum commitment. After 1 hour, you can delete it anytime. Azure doesn't bill for unused PTU hours if deleted within the same hour.
Do I need to use Azure infrastructure elsewhere?
No. Azure OpenAI is standalone. You don't need to migrate other services to Azure to use Azure OpenAI (though integration is easier if you already use Azure).
What about RBAC and API key management?
Azure OpenAI integrates with Azure Active Directory (AD). You can assign roles (admin, user, reader) to team members. Finer-grained access control than OpenAI API's simple key model.
Can I use multiple regions for failover?
Yes. Set up API endpoints in multiple regions. Requests route to primary, fail over to secondary if primary is down. Adds latency but improves reliability.
How does data residency cost compare to OpenAI API?
Similar per-token rates. Azure advantage is compliance guarantees (GDPR, HIPAA), not raw cost. For regulated data, Azure is standard. For general use, OpenAI API is competitive.
Can I mix on-demand and PTU?
Yes. Use on-demand for bursty loads, PTU for predictable base load. Some teams run both simultaneously (hybrid).
Related Resources
- Azure OpenAI Service Documentation
- OpenAI API Documentation
- OpenAI Pricing Guide
- Anthropic Claude Pricing
- DeepSeek Pricing Guide