Contents
- GB200 GPU Overview
- AWS GB200 Pricing
- How to Rent GB200 on AWS
- Performance Benchmarks
- Competing Alternatives
- FAQ
- Related Resources
- Sources
GB200 GPU Overview
Gb200 AWS Pricing is the focus of this guide. The GB200 is NVIDIA's Grace Blackwell superchip: one Grace CPU (72 ARM Neoverse V2 cores) paired with two B200 GPUs, connected via NVLink-C2C. The GB200 NVL72 rack-scale system contains 36 Grace CPUs and 72 B200 GPUs. Each B200 GPU has 192GB HBM3e memory at ~8.0 TB/s bandwidth. Huge. Made for mega-scale inference and long-context work (1M+ tokens fit).
Specs:
- Memory: 192GB HBM3e (per B200 GPU)
- Memory Bandwidth: ~8.0 TB/s (per B200 GPU)
- CUDA Cores: 26,624
- FP8 peak: ~9 PFLOPS (per B200 GPU)
- Grace CPU: 72 Neoverse V2 cores, 480GB LPDDR5x
- NVLink-C2C: 900 GB/s (Grace-to-Blackwell)
- Partitionable (multi-instance GPU mode)
For related GPUs, see B200 specs and H200 specs.
AWS GB200 Pricing
GB200 on AWS runs $6-8/hour on-demand (depends on region). Reserved (1-year) gets 30-40% off. Reserved (3-year) gets 50-60% off. Spot hits 50-70% savings.
Data transfer out: $0.02/GB. Inbound free. Some regions require capacity reservation signup.
Compare with Google Cloud and Azure if cost matters.
How to Rent GB200 on AWS
- Log into AWS Console.
- Go to EC2 → Launch Instance.
- Search "gb200" (usually p5 family).
- Pick Ubuntu 22.04 or a deep learning AMI (CUDA/PyTorch pre-installed).
- Configure security groups and SSH.
- Launch. Takes 3-5 minutes.
AWS deep learning AMIs save setup time. Just SSH in and start training or serving.
Common uses: large-scale inference serving, fine-tuning massive models, long-context document processing, multi-model orchestration.
Also check RunPod and Lambda pricing if developers want alternatives.
Performance Benchmarks
Inference (single GB200):
- Llama 2 70B (FP8): 500-700 tokens/sec
- Llama 3 405B (FP8): 200-300 tokens/sec
The 192GB memory fits Llama 3 405B (FP8) with room for context windows.
Training: 405B models run at 500-700 tokens/sec with optimization. Mixed precision is 30-40% faster than FP32.
Context: Handles sequences up to 1M tokens. 256K per model without special tuning.
Competing Alternatives
| Provider | GPU | $/hr | Memory | Best For |
|---|---|---|---|---|
| AWS | GB200 | $6-8 | 192GB | Large-scale production |
| Google Cloud | H100 | $3-5 | 80GB | TPU bundle |
| Azure | H200 | $4-6 | 141GB | Windows shops |
| CoreWeave | GB200 | Variable | 192GB | Direct GPU rental |
| Lambda | B200 SXM | $6.08 | 192GB | Community support |
AWS wins on maturity and compliance. CoreWeave and Lambda are simpler if developers just want GPUs.
FAQ
Can I run GB200 instances in spot mode? Yes, spot GB200 instances cost 50-70% less but risk interruption.
Does AWS provide fractional GB200 access? GB200 supports multi-instance GPU mode, allowing partition into smaller allocations.
What is the minimum monthly commitment? On-demand instances have no minimum. Reserved instances require 1 or 3-year terms.
How long does GB200 provisioning take? Most instances launch within 3-5 minutes.
What regions offer GB200? Availability varies. US regions (us-east, us-west) typically have capacity.
Related Resources
Sources
- NVIDIA GB200 Grace Hopper Tensor GPU Specifications
- AWS EC2 Instance Types Documentation (official)
- NVIDIA CUDA Toolkit & cuDNN Documentation