Azure OpenAI Pricing: PTU vs On-Demand Comparison

Deploybase · February 3, 2026 · LLM Pricing

Contents


Overview

Azure OpenAI has two pricing modes: PTU (Provisioned Throughput Units) and pay-as-developers-go.

PTU requires upfront commitment but saves 30-50% per token. Pay-as-developers-go has zero commitment, works for variable loads. Same models as OpenAI API (GPT-4, Turbo, 4o).

Production workloads with stable throughput? PTU wins. Experimental or bursty? Pay-as-developers-go. Azure costs roughly 15-20% more per token than OpenAI API, but compliance, data residency, and bulk pricing matter for regulated industries.


Azure OpenAI Pricing Models

Pay-as-Developers-Go (On-Demand)

Same as OpenAI API pricing but hosted on Azure infrastructure.

ModelInput $/MOutput $/MContextMax Output
GPT-4 Turbo$10.00$30.00128K4K
GPT-4o$2.50$10.00128K16K
GPT-4o Mini$0.15$0.60128K16K
GPT-4.1$2.00$8.001.05M32K
GPT-4.1 Mini$0.40$1.601.05M32K

No minimums. Pay per token (prompt and completion metered separately).

Monthly cost example: 100M tokens (60M prompt, 40M completion).

  • GPT-4o: (60 × $2.50) + (40 × $10.00) = $150 + $400 = $550/month

Provisioned Throughput Units (PTU)

Reserve capacity upfront. Lower per-token rate in exchange for minimum commitment.

PTU: Unit of throughput. 1 PTU roughly = 1,000 tokens per minute (TPM).

ModelPTU Cost/HourHourly ThroughputMonthly Cost (730 hrs)
GPT-4 Turbo$1001M TPM$73,000
GPT-4o$401M TPM$29,200
GPT-4o Mini$101M TPM$7,300
GPT-4.1$801M TPM$58,400
GPT-4.1 Mini$201M TPM$14,600

PTU pricing is per hour, not per token. Flat rate regardless of actual usage. Reserve only what users need; unused PTU is waste.

Minimum commitment: 1 hour (can be deleted after 1 hour, but rare for production).


PTU Costs

Understanding Throughput

1 PTU = ~1,000 tokens per minute (TPM) sustained throughput.

For a 7B parameter model:

  • Batch inference (summarization, classification): high throughput, variable latency
  • Interactive chat (low batch size): lower throughput, strict latency SLA

Throughput calculation:

  • Tokens/minute = average request size × requests/minute
  • Example: 500 token requests, 100 requests/minute = 50,000 TPM = 50 PTUs

Cost Structure

PTU pricing includes:

  1. Base hourly rate (varies by model)
  2. No per-token overage (use as much as reserved capacity allows)
  3. Quota increase (scale to higher TPM by adding PTUs)

Once users exceed the PTU reservation, requests are throttled (delayed) rather than rejected. Developers can upgrade PTU anytime (instant, no downtime).

PTU Sizing

Use CaseThroughputPTUs NeededCost/Month
Low (one user, async)100 TPM0.1$2,920
Medium (100 users, async)10K TPM10$292,000
High (1000 users, real-time)500K TPM500$14.6M
Extreme (multi-region, massive)2M TPM2000$58.4M

PTU costs scale linearly. For very high throughput, PTU is much cheaper than pay-as-developers-go (which would cost millions).


Pay-as-Developers-Go Costs

Per-Token Pricing

Based on OpenAI API rates, with minor regional variation (Azure might be ±5%).

Example Workload: Document Classification (100M tokens/month)

Task: Classify 1M documents (500 token input, 50 token output).

  • Inputs: 1M × 500 = 500M tokens
  • Outputs: 1M × 50 = 50M tokens

Cost on Azure OpenAI (GPT-4o):

  • Input: 500M × $2.50/M = $1,250
  • Output: 50M × $10.00/M = $500
  • Total: $1,750/month

Cost on OpenAI API (GPT-4o):

  • Same rate
  • Total: $1,750/month

Azure and OpenAI API cost the same for pay-as-developers-go. The Azure advantage is in PTU or compliance features.

Overages and Limits

No overage charges. But Azure enforces rate limits (TPM quotas). If users exceed the quota, requests are throttled:

  • Quota: 40,000 TPM by default
  • Exceed: Requests queued, delayed, or rejected

Increase quota by requesting PTU or by contacting Azure support.


PTU vs On-Demand

Break-Even Analysis

PTU becomes cheaper when monthly token usage × on-demand rate exceeds PTU cost.

Example: GPT-4o

Monthly TokensOn-Demand CostPTU Cost (40/hr)Winner
10M$200$29,200On-Demand
100M$2,000$29,200On-Demand
500M$10,000$29,200On-Demand
1B$20,000$29,200On-Demand
2B$40,000$29,200PTU
5B$100,000$29,200PTU

PTU break-even: ~2B tokens/month for GPT-4o.

For teams processing 2B+ tokens monthly, PTU saves money. Below 2B, on-demand is cheaper.

Workload Types

On-Demand Best For:

  • R&D and experimentation (unpredictable volume)
  • Bursty loads (high traffic for 1-2 hours, then idle)
  • Low-volume production (<500M tokens/month)
  • Cost-sensitive early-stage startups
  • One-off projects

PTU Best For:

  • Predictable, high-volume production (2B+ tokens/month)
  • SaaS platforms serving thousands of users
  • Content generation at scale
  • 24/7 always-on services
  • Teams with forecasted budget

Switching Between Models

PTU is per-model. If users need GPT-4o and GPT-4o Mini:

  • Separate PTUs: One for GPT-4o ($40/hr), one for GPT-4o Mini ($10/hr)
  • Total cost: $50/hr

Alternatively, use on-demand for one and PTU for the other (hybrid).


Data Residency

Azure Advantage

Azure OpenAI stores data in specific regions (no US-only default like OpenAI API).

Availability by Region (as of March 2026):

  • East US, West US: United States
  • West Europe: European Union (GDPR-compliant)
  • Southeast Asia: Singapore
  • Japan East: Japan
  • UK South: United Kingdom

Data stays in the specified region. OpenAI API defaults to US, with limited options.

Compliance Benefits

GDPR (EU): If processing EU resident data, Azure EU regions are preferred (Microsoft guarantees compliance).

HIPAA (Healthcare): Azure OpenAI supports HIPAA business associate agreements (BAA). OpenAI API does not.

SOC 2 Type II: Azure certified. OpenAI API has SOC 2, but Azure adds stronger guarantees.

FedRAMP (US Government): Azure Government Cloud available. OpenAI API not available for government use.

Cost Impact of Residency

Regional pricing is similar (within 5%). EU regions may cost slightly more.

RegionGPT-4o Input $/MGPT-4o Output $/M
East US$2.50$10.00
West Europe$2.70$10.80
Southeast Asia$2.50$10.00

The compliance/residency benefit is worth the cost for regulated industries (healthcare, finance, government).


Cost Optimization

Strategy 1: Right-Size PTU Reservation

Avoid over-provisioning. Monitor actual usage (Azure Portal, PTU utilization metrics). Right-size quarterly.

Current: Provisioned 50 PTUs
Actual Peak Usage: 30 PTUs (60% utilization)
Waste: 20 PTUs × $40/hr × 730 = $584,000/year

Action: Reduce to 35 PTUs, save ~$292,000/year

Strategy 2: Use Cheaper Models When Possible

GPT-4o Mini is 90% cheaper than GPT-4o Turbo on most tasks. Benchmark:

  • Classification, summarization, simple Q&A: Use Mini. Save 90% cost.
  • Math, complex reasoning, creative writing: Use Turbo or GPT-4.1. Worth the cost.

Example: Classify 1M documents.

  • Using GPT-4o Turbo: 500M inputs × $10 + 50M outputs × $30 = $6,500
  • Using GPT-4o Mini: 500M inputs × $0.15 + 50M outputs × $0.60 = $810

Mini saves $5,690/month (87% cheaper).

Strategy 3: Batch Processing

Azure OpenAI Batch API (preview) offers 50% discounts for non-real-time processing.

  • Submit: 1M token requests in batch
  • Wait: Up to 24 hours for results
  • Save: 50% per token

Cost: 1M inputs × $0.25 × 50% = $125 (vs $2.50 on-demand).

Caveat: Only for workloads that can tolerate 24-hour latency (reports, analysis, not chat).

Strategy 4: Hybrid Model Mix

Use different models for different tasks:

TaskModelCost
ChatGPT-4o Mini$0.15/$0.60 per M tokens
ClassificationMini$0.15/$0.60
Creative writingGPT-4o$2.50/$10.00
Math/reasoningGPT-4.1$2.00/$8.00

Most workloads can use Mini. Only 10-20% require premium models. Overall cost: 60-70% cheaper than all-Turbo.

Strategy 5: Caching (Preview)

Azure OpenAI Prompt Caching reduces costs for repeated prompts.

  • First request: Full cost
  • Subsequent requests (same prompt prefix): 90% discount on cached portion

Example: System prompt (1K tokens) + user prompt (100 tokens).

  • First request: 1100 tokens @ normal rate
  • Requests 2-100: 100 tokens @ normal rate + 1K tokens @ 10% cost (cached)

Savings compound for repetitive workloads (summarization, classification with fixed instructions).


Real-World Examples

Example 1: SaaS Content Generation Platform

A writing assistant generating summaries and rewrites for 10K daily active users.

Load:

  • 10K users × 5 requests/day × 200 tokens avg = 10M tokens/day = 300M tokens/month
  • Split: 200M input, 100M output (GPT-4o)

On-Demand Cost:

  • Input: 200M × $2.50/M = $500
  • Output: 100M × $10.00/M = $1,000
  • Total: $1,500/month

PTU Cost:

  • Need: 10M tokens/day ÷ 1440 min = 6,944 TPM ≈ 7 PTUs
  • Cost: 7 × $40/hr × 730 = $204,400/month
  • Total: $204,400/month

Winner: On-demand. PTU is overkill. Only consider PTU if scale to 50K+ users or switch to batch processing.

A law firm reviewing 100K documents/month (500 tokens each, 50 token extraction).

Load:

  • 100K × 500 = 50M input tokens
  • 100K × 50 = 5M output tokens = 55M tokens total/month

On-Demand Cost (GPT-4o Mini for extraction, GPT-4 for complex review):

  • Mini: 50M × $0.15 + 5M × $0.60 = $10,500
  • Total: $10,500/month

PTU Cost:

  • 50M tokens ÷ 30 days ÷ 1440 min = 23 TPM ≈ 0.02 PTUs
  • Minimum: 1 PTU × $40/hr × 730 = $29,200/month
  • Total: $29,200/month

Alternative: Batch Processing

  • 50M inputs × $0.15 × 50% (batch discount) = $3,750
  • 5M outputs × $0.60 × 50% = $1,500
  • Total: $5,250/month (latency: up to 24 hours)

Winner: Batch processing. Save $5,250/month vs on-demand ($55K vs on-demand $10.5K), and only 24-hour wait.

Example 3: Multilingual Customer Support

A support platform handling 1M customer queries/month in 10 languages. Task: translate + classify intent.

Load:

  • 1M queries × 200 tokens = 200M input
  • Classification + 50 token response = 50M output
  • Total: 250M tokens/month

Costs:

  • On-demand GPT-4o Mini: (200M × $0.15) + (50M × $0.60) = $30K + $30K = $60,000/month
  • PTU (250M ÷ 30 ÷ 1440 min = 174 TPM ≈ 0.2 PTUs, but min 1): $40/hr × 730 = $29,200/month

Winner: PTU. Saves $30.8K/month despite overkill on throughput (under-utilized). If volume grows to 2B tokens/month (8x), PTU becomes clear winner.


FAQ

How much can I save with PTU?

30-50% per token at high volume (2B+ tokens/month). Below 1B tokens/month, on-demand is cheaper (no PTU commitment).

Is Azure OpenAI slower than OpenAI API?

No. Same models, same latency. Azure hosting may vary slightly by region (EU regions might add 5-10ms), but negligible.

Can I use Azure OpenAI with on-demand billing?

Yes. Pay-as-you-go is the default. PTU is optional, only for high-volume production.

What happens if I exceed my PTU allocation?

Requests are throttled (delayed) until you add more PTU capacity or usage decreases. No overage charges. No hard rejection.

Can I cancel PTU anytime?

PTU requires 1-hour minimum commitment. After 1 hour, you can delete it anytime. Azure doesn't bill for unused PTU hours if deleted within the same hour.

Do I need to use Azure infrastructure elsewhere?

No. Azure OpenAI is standalone. You don't need to migrate other services to Azure to use Azure OpenAI (though integration is easier if you already use Azure).

What about RBAC and API key management?

Azure OpenAI integrates with Azure Active Directory (AD). You can assign roles (admin, user, reader) to team members. Finer-grained access control than OpenAI API's simple key model.

Can I use multiple regions for failover?

Yes. Set up API endpoints in multiple regions. Requests route to primary, fail over to secondary if primary is down. Adds latency but improves reliability.

How does data residency cost compare to OpenAI API?

Similar per-token rates. Azure advantage is compliance guarantees (GDPR, HIPAA), not raw cost. For regulated data, Azure is standard. For general use, OpenAI API is competitive.

Can I mix on-demand and PTU?

Yes. Use on-demand for bursty loads, PTU for predictable base load. Some teams run both simultaneously (hybrid).



Sources