GB200 on AWS: Pricing, Specs & How to Rent

Deploybase · October 7, 2025 · GPU Pricing

Contents

GB200 GPU Overview

Gb200 AWS Pricing is the focus of this guide. The GB200 is NVIDIA's Grace Blackwell superchip: one Grace CPU (72 ARM Neoverse V2 cores) paired with two B200 GPUs, connected via NVLink-C2C. The GB200 NVL72 rack-scale system contains 36 Grace CPUs and 72 B200 GPUs. Each B200 GPU has 192GB HBM3e memory at ~8.0 TB/s bandwidth. Huge. Made for mega-scale inference and long-context work (1M+ tokens fit).

Specs:

  • Memory: 192GB HBM3e (per B200 GPU)
  • Memory Bandwidth: ~8.0 TB/s (per B200 GPU)
  • CUDA Cores: 26,624
  • FP8 peak: ~9 PFLOPS (per B200 GPU)
  • Grace CPU: 72 Neoverse V2 cores, 480GB LPDDR5x
  • NVLink-C2C: 900 GB/s (Grace-to-Blackwell)
  • Partitionable (multi-instance GPU mode)

For related GPUs, see B200 specs and H200 specs.

AWS GB200 Pricing

GB200 on AWS runs $6-8/hour on-demand (depends on region). Reserved (1-year) gets 30-40% off. Reserved (3-year) gets 50-60% off. Spot hits 50-70% savings.

Data transfer out: $0.02/GB. Inbound free. Some regions require capacity reservation signup.

Compare with Google Cloud and Azure if cost matters.

How to Rent GB200 on AWS

  1. Log into AWS Console.
  2. Go to EC2 → Launch Instance.
  3. Search "gb200" (usually p5 family).
  4. Pick Ubuntu 22.04 or a deep learning AMI (CUDA/PyTorch pre-installed).
  5. Configure security groups and SSH.
  6. Launch. Takes 3-5 minutes.

AWS deep learning AMIs save setup time. Just SSH in and start training or serving.

Common uses: large-scale inference serving, fine-tuning massive models, long-context document processing, multi-model orchestration.

Also check RunPod and Lambda pricing if developers want alternatives.

Performance Benchmarks

Inference (single GB200):

  • Llama 2 70B (FP8): 500-700 tokens/sec
  • Llama 3 405B (FP8): 200-300 tokens/sec

The 192GB memory fits Llama 3 405B (FP8) with room for context windows.

Training: 405B models run at 500-700 tokens/sec with optimization. Mixed precision is 30-40% faster than FP32.

Context: Handles sequences up to 1M tokens. 256K per model without special tuning.

Competing Alternatives

ProviderGPU$/hrMemoryBest For
AWSGB200$6-8192GBLarge-scale production
Google CloudH100$3-580GBTPU bundle
AzureH200$4-6141GBWindows shops
CoreWeaveGB200Variable192GBDirect GPU rental
LambdaB200 SXM$6.08192GBCommunity support

AWS wins on maturity and compliance. CoreWeave and Lambda are simpler if developers just want GPUs.

FAQ

Can I run GB200 instances in spot mode? Yes, spot GB200 instances cost 50-70% less but risk interruption.

Does AWS provide fractional GB200 access? GB200 supports multi-instance GPU mode, allowing partition into smaller allocations.

What is the minimum monthly commitment? On-demand instances have no minimum. Reserved instances require 1 or 3-year terms.

How long does GB200 provisioning take? Most instances launch within 3-5 minutes.

What regions offer GB200? Availability varies. US regions (us-east, us-west) typically have capacity.

Sources

  • NVIDIA GB200 Grace Hopper Tensor GPU Specifications
  • AWS EC2 Instance Types Documentation (official)
  • NVIDIA CUDA Toolkit & cuDNN Documentation