Contents
- H100 AWS: EC2 p5 for AI Training and Inference
- AWS p5 Instance Pricing
- AWS EC2 p5 Instance Specifications
- Spot Instance Strategy and Interruption Handling
- Launching AWS p5 Instances
- Performance Benchmarks on AWS p5
- Cost Optimization Strategies for AWS p5
- Comparing AWS to Dedicated GPU Providers
- Production Deployment Patterns
- FAQ
- Sources
H100 AWS: EC2 p5 for AI Training and Inference
H100 AWS ships as the p5.48xlarge instance: 8xH100 GPUs with 192 vCPU, 1.1TB RAM, and 400Gbps networking. On-demand runs $55.04/hr as of March 2026. But developers get CPUs, memory, networking, and tight integration with SageMaker, CloudWatch, and IAM. Useful if the team needs managed infrastructure. Less useful if developers just want cheap GPU hours.
This covers AWS p5 pricing, when to use Spot or Reserved instances, optimization strategies, and cost comparison to dedicated providers.
AWS p5 Instance Pricing
AWS prices H100 compute exclusively as 8-GPU clusters via the p5.48xlarge instance. Individual H100 instances are not available.
p5.48xlarge Pricing Breakdown and Monthly Analysis
| Pricing Model | Hourly | Monthly (730 hrs) | Annual | Per-GPU Cost | Per-Hour Per-GPU |
|---|---|---|---|---|---|
| On-Demand | $55.04 | $40,179 | $482,150 | $6.88 | $6.88 |
| 1-Year Reserved | $27.52 | $20,090 | $241,075 | $3.44 | $3.44 |
| 3-Year Reserved | $24.02 | $17,535 | $210,415 | $3.00 | $3.00 |
| Spot (typical) | $16.51 | $12,052 | $144,629 | $2.06 | $2.06 |
Per-GPU on-demand: $6.88/hr ($55.04 / 8 GPUs). RunPod H100 SXM runs $2.69/hr ($0.34 per GPU). But AWS includes 192 vCPU, 1.1TB RAM, and 400Gbps EFA networking. Worth maybe $15-20/hr on its own. Changes the math entirely.
True Cost Comparison (Including CPU, Memory, Networking)
| Component | AWS p5 Value | Standalone Cost | Total AWS TCO |
|---|---|---|---|
| 8x H100 GPU | $55.04/hr | Baseline | $55.04/hr |
| 192 vCPU CPU | +$400-500/month | $800-1000/mo | +$0.55-1.37/hr |
| 1,152GB RAM | Included | $200-300/month | +$0.27-0.41/hr |
| 400Gbps EFA | Included | $500-1000/month | +$0.68-1.37/hr |
| Total AWS value | - | - | $56.54-58.19/hr |
| Comparable RunPod setup | (8x single instances) | - | ~$21.52/hr (bare GPU) |
| Value added by AWS services | - | - | ~$34.54-36.19/hr premium |
AWS premium makes sense only if developers need managed services, IAM, and automated infrastructure. Otherwise overpay.
Savings Plans
AWS offers compute Savings Plans providing ~50% discounts on on-demand pricing for 1-year or 3-year commitments. A p5.48xlarge under 1-year Savings Plan costs ~$27.52/hr, well below CoreWeave's on-demand pricing per GPU ($6.16/GPU equivalent).
AWS EC2 p5 Instance Specifications
Hardware Configuration
| Component | Specification |
|---|---|
| GPU | 8x H100 SXM |
| GPU Memory | 640GB (80GB per GPU) |
| CPU | 192 vCPU (AWS Graviton3) |
| System Memory | 1,152GB RAM |
| Networking | 400Gbps EFA (Elastic Fabric Adapter) |
| Storage | EBS only (no local NVMe) |
| Interconnect | NVLink 900GB/s between GPUs |
The AWS Graviton3 CPU provides 2.6x better performance-per-watt than x86 alternatives, important for CPU-bound pre/post-processing.
EFA Networking
EFA gives developers 400Gbps bandwidth for GPU-to-GPU talk across instances. Much better for distributed training:
- Standard EC2 networking: 25Gbps
- EFA: 400Gbps (16x faster)
- CoreWeave NVLink: 900GB/s (single cluster only)
Spot Instance Strategy and Interruption Handling
AWS Spot Pricing Dynamics
AWS's Spot pricing typically discounts p5.48xlarge to $16-20/hr (65-70% savings), making it competitive with dedicated providers for interruptible workloads. Savings calculation:
| Scenario | On-Demand Cost | Spot Cost | Savings |
|---|---|---|---|
| 24-hour training | $1,320.96 | $396.29 | $924.67 (70%) |
| 7-day training | $9,246.72 | $2,774.02 | $6,472.70 (70%) |
| 30-day training | $39,629.20 | $11,888.76 | $27,740.44 (70%) |
For resumable training with checkpointing, Spot provides exceptional cost savings.
Spot Interruption Rates and Capacity
Spot instances maintain p5 availability at roughly 90% uptime during standard hours (availability varies by region/zone). AWS provides 2-minute interruption warnings allowing graceful shutdown and checkpoint saving:
import signal
def handle_spot_termination(signum, frame):
print("Received Spot termination notice. Saving checkpoint...")
save_model_checkpoint()
sys.exit(0)
signal.signal(signal.SIGTERM, handle_spot_termination)
for epoch in range(100):
for step, batch in enumerate(train_loader):
# Training code
if step % 100 == 0:
save_checkpoint() # Save every 100 steps (~5 minutes at typical throughput)
With 100-step checkpoint frequency and 2-minute warning window, maximum loss is one checkpoint cycle (~5 minutes of training).
Checkpointing Strategy
Enable continuous checkpointing for Spot workloads:
import signal
import sys
def handle_interruption(signum, frame):
print("Spot interruption notice received. Saving checkpoint...")
save_model_checkpoint()
sys.exit(0)
signal.signal(signal.SIGTERM, handle_interruption)
for epoch in range(num_epochs):
for step, batch in enumerate(train_loader):
# Training code
if step % 100 == 0:
save_model_checkpoint()
This approach enables p5 Spot usage for long-running training at ~$16-20/hr effective cost.
Launching AWS p5 Instances
From the AWS Console
- Access AWS EC2 dashboard at https://console.aws.amazon.com/ec2/
- Click "Launch Instances"
- Search for instance type: "p5.48xlarge"
- Select most recent Deep Learning AMI (provided by AWS for GPU optimizations)
- Configure instance:
- VPC: Default or custom
- Subnet: Select region closest to data
- Auto-assign public IP: Enable
- Storage: 100GB EBS minimum for OS/dependencies
- Configure security group: Allow SSH (port 22) from the local IP
- Add tags for cost allocation and identification
- Review and launch (provisioning takes 5-10 minutes)
AWS-Specific Advantages
AWS p5 integrates with SageMaker, CloudWatch, IAM, and S3 out of the box. That matters if teams:
- Run distributed training across multiple instances (SageMaker handles orchestration)
- Need per-team cost allocation and access control (IAM)
- Store datasets in S3 and want low-friction access
- Require audit logging for compliance
For solo engineers or small teams? Just noise. Dedicated providers are cheaper and simpler.
Multi-Instance Cluster Training
AWS's p5 can launch multiple instances as a single distributed training cluster through distributed training frameworks:
import sagemaker
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='train.py',
instance_type='ml.p5.48xlarge',
instance_count=4,
framework_version='2.0'
)
estimator.fit()
This enables training models exceeding 500B parameters with coordinated compute.
Performance Benchmarks on AWS p5
H100 Inference Benchmarks
AWS p5.48xlarge achieves standard H100 throughput:
| Model | Batch Size | Throughput | Cost/1M Tokens |
|---|---|---|---|
| 70B Llama-2 | 1 | 50 tokens/sec | $3.06 |
| 70B Llama-2 | 8 | 280 tokens/sec | $0.55 |
| 70B Llama-2 | 64 | 400 tokens/sec | $0.38 |
EFA networking provides negligible latency improvement for single-instance inference.
Distributed Training Performance
Multi-instance p5 clusters show scaling:
| Configuration | Throughput (70B Model) | All-Reduce Overhead | Per-Instance Cost |
|---|---|---|---|
| 1x p5 | 1,000 tokens/sec | N/A | $55.04/hr |
| 2x p5 | 1,950 tokens/sec | 2.5% | $110.08/hr |
| 4x p5 | 3,850 tokens/sec | 3.2% | $220.16/hr |
Linear scaling demonstrates EFA effectiveness for distributed training.
Cost Optimization Strategies for AWS p5
Savings Plan Selection and ROI Analysis
For sustained workloads exceeding 3 months, 1-year Savings Plans provide 49% discount over on-demand ($49/hr effective). For longer commitments, 3-year plans add marginal additional savings (13% more, reaching $42.82/hr).
Calculate break-even for Savings Plan purchase:
| Duration | On-Demand Cost | 1-Year Savings Plan | Savings |
|---|---|---|---|
| 3 months (2,190 hrs) | $120,538 | $60,269 | $60,269 |
| 6 months (4,380 hrs) | $241,075 | $120,538 | $120,537 |
| 12 months (8,760 hrs) | $482,150 | $241,075 | $241,075 |
For workloads exceeding 750 hours (31 days continuous, or ~3 months part-time), 1-year Savings Plans justify upfront commitment.
Spot Plus Reserved Hybrid
Combine reserved instances for baseline load with Spot for burst:
- Reserve 1x p5.48xlarge (1-year Savings Plan) = $27.52/hr baseline cost
- Launch additional p5.48xlarge Spot instances on demand ($29-35/hr each)
- For predictable variable load (e.g., 2-4x capacity during peak hours), hybrid approach optimizes cost
Data Transfer Optimization
AWS egress charges ($0.02/GB out) create hidden costs when downloading training datasets or model exports. Use:
- S3 Transfer Acceleration for faster downloads
- VPC endpoints for AWS service access without egress charges
- CloudFront distribution for repeated dataset access
Comparing AWS to Dedicated GPU Providers
Raw GPU Cost
| Provider | H100 Cost (Single GPU) | 8x Cluster |
|---|---|---|
| RunPod | $2.69/hr SXM | N/A (direct) |
| Lambda | $3.78/hr SXM | $27.52/hr |
| CoreWeave | $6.16/hr (8x cluster) | $49.24/hr |
| AWS | $6.88/hr on-demand | $55.04/hr |
AWS costs 2.6x more per GPU than RunPod on-demand, or slightly above CoreWeave.
Total Cost of Ownership
AWS's advantage emerges when including:
- 192 vCPU compute (worth ~$400/month separately)
- 1,152GB system RAM
- 400Gbps EFA networking
- Managed service integration
- Data transfer and storage ecosystem
For teams requiring integrated infrastructure, AWS becomes cost-competitive at 1-year Savings Plan rates ($27.52/hr → $3.44/GPU equivalent).
Compare AWS to Lambda Labs H100 pricing for cheaper multi-GPU options. Check CoreWeave H100 if teams want lower GPU-only costs with better networking.
Production Deployment Patterns
SageMaker Training
Use SageMaker for multi-instance training. It handles distributed setup automatically:
from sagemaker import get_execution_role, Session
from sagemaker.pytorch import PyTorch
session = Session()
role = get_execution_role()
pytorch_estimator = PyTorch(
entry_point='train.py',
role=role,
instance_type='ml.p5.48xlarge',
instance_count=2,
hyperparameters={'epochs': 10, 'batch_size': 32}
)
pytorch_estimator.fit(training_data)
SageMaker handles distributed training setup, parameter server coordination, and fault tolerance automatically.
Model Serving with SageMaker
Deploy trained models through SageMaker Endpoints with automatic scaling:
pytorch_estimator.deploy(
initial_instance_count=1,
instance_type='ml.p5.48xlarge'
)
This integrates model serving with monitoring, logging, and A/B testing infrastructure.
FAQ
Is AWS p5 worth the cost premium versus RunPod or Lambda?
AWS excels for teams with 6+ member data science orgs requiring integrated IAM, logging, and managed services. For solo engineers or small teams, RunPod ($2.69/hr SXM) or Lambda ($3.78/hr SXM) provide better unit economics. AWS becomes cost-effective at annual Savings Plan rates ($27.52/hr for 8xH100) for sustained production workloads.
How does AWS Spot pricing for p5.48xlarge compare to permanent reserved instances?
Spot averages $16-20/hr (65-70% savings) versus on-demand $55.04/hr. For resumable workloads with checkpointing, Spot is optimal. For continuous serving, 1-year Savings Plan ($27.52/hr) offers better availability at lower cost.
What's the minimum configuration to run large models on AWS p5?
Single p5.48xlarge with 8xH100 and 640GB VRAM handles any single-GPU-parallelizable model (up to ~500B parameters). For models exceeding this, launch 2x p5.48xlarge instances (16 H100s). AWS handles distributed training setup through SageMaker.
Should I use p5 on-demand versus Spot versus Reserved for experimentation?
For temporary experiments (< 1 week): Use Spot at ~$16-20/hr to minimize cost. Implement checkpointing for interruption tolerance. For medium-term (1-3 months): Use 1-year Savings Plan for ~50% discount and guaranteed capacity. For one-time quick tests (< 4 hours): On-demand is acceptable despite higher rate.
How does AWS p5 TCO compare when factoring in managed services versus bare RunPod instances?
AWS p5 ($55.04/hr on-demand) includes CPU, memory, networking, and managed service integration. Equivalent capability on RunPod (8x single H100 instances at $2.69/hr each = $21.52/hr) requires manual infrastructure, Docker/Kubernetes setup, and monitoring. AWS premium (~$33/hr) is justified for teams with >5 engineers who avoid writing infrastructure code. For small teams or researchers, RunPod provides better cost per GPU despite higher management overhead.
Sources
- AWS EC2 p5 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/
- AWS EC2 p5 Instance Details: https://aws.amazon.com/ec2/instance-types/p5/
- AWS SageMaker Documentation: https://docs.aws.amazon.com/sagemaker/
- NVIDIA H100 Specifications: https://www.nvidia.com/en-us/data-center/h100/