NVIDIA GB200 NVL72: Specs & Cloud Pricing

Nvidia Gb200 Nvl72: Overview
What Is the GB200 NVL72?
Hardware Specs & Architecture
Performance Metrics & Benchmarks
Cloud Pricing & Availability
Use Cases & Deployment Scenarios
GB200 NVL72 vs H100 vs B200
FAQ
Related Resources
Sources

Nvidia Gb200 Nvl72: Overview

The NVIDIA GB200 NVL72 represents the cutting edge of hyperscaler AI infrastructure, combining the Grace CPU with the Blackwell GPU in a liquid-cooled, 72-GPU rack system. This architecture delivers massive parallelism for trillion-parameter model training and inference at scale. Understanding the GB200 NVL72 specs and cloud pricing helps teams evaluate whether this system fits their training requirements and operational budgets.

The NVL72 is purpose-built for sovereign AI initiatives and hyperscaler workloads that demand extreme density. Unlike traditional GPU clusters, this system integrates compute, memory, and cooling into a single optimized unit, eliminating many deployment bottlenecks.

Specification	Value
GB200 Superchips per Rack	36
Total B200 GPUs per Rack	72
Total Grace CPUs per Rack	36
GPU Type	NVIDIA Blackwell (B200)
Peak FP4 Performance	1.4 exaFLOPS
Total HBM3e Memory	13.5TB
Connectivity	InfiniBand / NVLink
Power Per Rack	~70kW
Cooling	Liquid-cooled only
Form Factor	Full-rack system

What Is the GB200 NVL72?

The GB200 NVL72 is not just a GPU. It's a unified computing platform pairing NVIDIA's Grace CPU with Blackwell GPUs. Each GB200 Superchip pairs one 144-core Grace CPU with two B200 GPUs via NVLink-C2C. The NVL72 contains 36 GB200 Superchips per rack, totaling 72 B200 GPUs and 36 Grace CPUs connected via NVLink Switch.

Grace CPU + Blackwell GPU Pairing

The Grace processor is a 144-core Arm-based CPU optimized for AI workloads. It shares unified memory with the two adjacent B200 GPUs via NVLink-C2C at 900 GB/s, enabling zero-copy data sharing. This tight coupling eliminates PCIe bottlenecks that plague traditional multi-socket systems.

Each B200 GPU has 192GB of HBM3e DRAM. Each Grace CPU contributes 480GB of LPDDR5X memory. Combined, each GB200 Superchip has 192GB × 2 + 480GB = 864GB addressable memory, with 36 Superchips per rack giving approximately 31TB total system memory (13.5TB HBM3e GPU memory across 72 GPUs).

NVL72 Rack Architecture

The NVL72 organizes 72 GB200 modules in a liquid-cooled form factor. All 72 GPUs connect via InfiniBand in-network computing (Quantum-3 or Spectrum-X for ultra-low latency). The interconnect fabric sustains microsecond-scale all-reduce operations essential for distributed training.

This tight density comes with tradeoffs. Developers cannot buy individual GB200 GPUs. Hyperscalers must purchase entire racks or negotiate smaller configurations. Entry cost is substantial. There's no per-card pricing model.

Design Philosophy

NVIDIA positioned the GB200 NVL72 for a specific segment: companies training models at trillion-parameter scale with massive budgets. The system assumes:

Liquid cooling infrastructure already exists
Power delivery can handle 70kW continuous draw
Networking engineers can orchestrate InfiniBand fabrics
Workloads can tolerate tightly coupled, immobile hardware

This is not a flexible multi-cloud GPU offering. This is infrastructure for dedicated AI labs and hyperscalers building their own data centers.

Hardware Specs & Architecture

GPU Specifications

NVIDIA Blackwell B200

Streaming Multiprocessors: 148 SMs (dual-die: 74 SMs per die)
CUDA Cores: 16,896 cores (on B200)
Tensor Performance: ~9 petaFLOPS FP8 per GPU; ~1.4 exaFLOPS FP4 per NVL72 rack
Memory Bandwidth: 8.0 TB/s per GPU
Memory Capacity: 192GB HBM3e per GPU
Max Power: ~1,000W per GPU at full utilization

The Blackwell architecture introduces fifth-generation Tensor Cores supporting FP4, FP8, and mixed-precision training. Native support for FP4 training enables higher throughput than previous generations at equivalent hardware cost.

System-Level Architecture

Memory Subsystem

13.5TB total HBM3e distributed across 72 GPUs
Unified memory space between each Grace CPU and B200 pair
Coherent caches enable lock-free synchronization primitives

Interconnect Fabric

72 GPUs connected via NVIDIA NVLink-ultra (400 GB/s per link)
InfiniBand Quantum or Spectrum-X for rack-to-rack communication
Latency: Sub-microsecond collective operations on single rack

Thermal Management

Liquid cooling mandatory (no passive or air-cooled variants)
Dual-loop cooling for redundancy
Integrated pump modules per GPU pair
~70kW total thermal load per full rack

Power Distribution

Requires three-phase 415V input (in many regions)
Redundant PSUs with N+1 failover
PDU integration for smart power metering

Performance Metrics & Benchmarks

Compute Density

The NVL72 delivers extreme density in a single rack.

FP4 Performance: 1.4 exaFLOPS per rack (confirmed via NVIDIA datasheet)
FP8 Performance: 720 PFLOPS (0.72 exaFLOPS) per rack (dense)
Dense Precision (higher accuracy): Scales linearly but at reduced throughput

For reference, a rack of 8x H100 GPUs achieves roughly 400 teraFLOPS in FP8. The NVL72 is 3,500x denser. Density matters when training 100+ trillion-parameter models.

Memory Bandwidth

Each GPU sustains 8.0 TB/s from its local HBM3e. Under perfect scaling, a full rack sees approximately 576 TB/s aggregate memory bandwidth across all 72 GPUs. Real-world collective operations (all-reduce, broadcast) consume ~70-80% of this due to synchronization overhead.

Collective Communication Performance

Distributed training requires moving weight gradients between GPUs. The NVL72's InfiniBand interconnect achieves microsecond-scale all-reduce latency.

Latency vs. H100 Clusters:

H100 cluster (8-GPU nodes + Ethernet): 100-500 microseconds per collective
NVL72 (single rack): 5-20 microseconds per collective

This 10-50x latency reduction compounds across millions of training steps, translating to 5-15% wall-clock speedup in large-batch distributed training.

Thermal Envelope

The liquid-cooled design enables sustained maximum clocks without thermal throttling. ClockBoost technology maintains peak frequency longer than air-cooled systems, yielding consistent performance across training workloads.

Sparsity & MoE Support

Blackwell native sparsity support accelerates Mixture-of-Experts (MoE) models. The architecture dynamically skips inactive computation paths, reducing effective FLOPs.

For a 1 trillion parameter MoE model with 4:16 sparsity (1/4 experts active), effective compute is:

Without sparsity optimization: 720 PFLOPS FP8 (dense)
With sparsity: 1.44 exaFLOPS FP4 (sparse)

This is critical for inference and training at extreme scale. H100 doesn't have native sparsity, requiring manual workarounds.

Precision & Quantization

B200 supports multiple numerical formats natively:

Precision	Performance	Use Case
FP4	1.44 exaFLOPS (sparse) / 720 PFLOPS (dense)	Training, ultra-low memory
FP8	720 PFLOPS (dense)	Training, inference
BF16	~360 petaFLOPS	Inference, mixed-precision
FP32	~180 petaFLOPS	Validation, high accuracy

Native FP4 is a game changer for trillion-parameter training. Previous generation (H100) required external quantization libraries, adding latency overhead. B200 handles FP4 arithmetic in hardware.

Cloud Pricing & Availability

As of March 2026, the GB200 NVL72 is not available through mainstream cloud providers as an on-demand, pay-per-hour service. NVIDIA released systems to select hyperscalers (AWS, Google Cloud, Azure) with limited general availability.

Current Availability Status

AWS: Tentative Q2 2026 availability under "p5e" instance family. Expected pricing . Reserved instances likely offered at 30-50% discount to on-demand rates.

Google Cloud: Announced but not yet available. Expected under "a4-megagpu" family. Pricing expected to align with H100 pricing multiplied by 1.3-1.5x (GB200 provides ~2.5x H100 performance in FP8).

Azure: Azure Quantum Elements includes GB200 via partnerships but not directly. Pricing through partners expected to be premium.

CoreWeave & Lambda: Specialty cloud providers with limited GB200 stock are onboarding systems. CoreWeave GPU pricing may include GB200 nodes as of Q2 2026. Expect hourly rates around $40-80 per GPU per hour (vs. $2-4 for H100).

Pricing Model Expectations

When available, pricing will likely follow this structure:

Per-GPU Hourly Rate

Estimated $50-75/hour per B200 GPU (on-demand)
$30-45/hour per GPU (1-year commitment)
$20-30/hour per GPU (3-year commitment)

Full-Rack Pricing

Estimated $3,600-5,400/hour for 72-GPU rack
Monthly: $2.6-3.9M per rack
Annual: $31.5-46.8M per rack at on-demand rates

These are estimates based on H100 pricing and announced NVIDIA positioning. Actual pricing depends on hyperscaler demand and capacity constraints.

Comparison to Alternatives

System	Cost/Hour	Compute/$	Availability
H100 SXM	$3-4	1.0x baseline	Wide
B200 (standalone)	$5-7	2.5x	Limited
GB200 NVL72	$50-75/GPU	3-4x	Very limited

The GB200 NVL72 premium reflects not just raw performance but system reliability, interconnect quality, and hyperscaler margin. A single failed GPU in a 72-GPU rack is manageable. Thermal management and InfiniBand configuration are complex; hyperscalers charge accordingly.

Use Cases & Deployment Scenarios

Trillion-Parameter Model Training

The primary use case for GB200 NVL72 is training models in the 1-100 trillion parameter range. Companies like Google, Meta, and OpenAI are already using comparable systems internally.

Example: Training a 10 trillion parameter MoE model with 32 expert groups requires extreme memory and bandwidth. A single NVL72 rack provides enough memory for model weights, optimizer states, and activations. Distributed training across 8 racks scales to 100 trillion parameters feasibly.

Sovereign AI Initiatives

National AI programs in Europe, Japan, and others are purchasing GB200 systems to build independent large-scale training infrastructure. The systems stay in-country and offer strategic independence from US-based hyperscaler APIs.

Foundation Model Research

Academic institutions and research labs with sufficient funding can lease GB200 time from AWS or Google Cloud to benchmark novel training algorithms. The system's density and performance enable rapid iteration on large-scale experiments.

Real-Time Inference at Scale

While primarily for training, GB200 supports inference workloads requiring extreme parallelism. A single NVL72 rack can serve thousands of concurrent requests for moderately large models (10-100B parameters) at sub-100ms latency.

GB200 NVL72 vs H100 vs B200

Raw Performance Comparison

FP8 Throughput (Peak)

H100 SXM: 20.5 petaFLOPS per GPU
B200: 29.1 petaFLOPS per GPU (1.4x H100)
NVL72 rack: 2.1 exaFLOPS (72x B200)

Memory Bandwidth per GPU

H100 SXM: 3.35 TB/s
B200: 8.0 TB/s
NVL72: ~576 TB/s aggregate (72 × 8.0 TB/s per rack)

Practical Training Performance

Wall-clock time to train a 10B parameter model on 100M tokens:

Single H100: ~8 hours
8x H100 (distributed): ~1.5 hours (5.3x improvement due to synchronization overhead)
Single B200: ~5.6 hours
8x B200 (distributed): ~0.9 hours (6.2x improvement)
NVL72 rack (72 B200): ~0.12 hours (70x improvement over single H100)

These numbers assume optimized training code. Real-world gains are 60-80% of theoretical due to synchronization, gradient all-reduce, and I/O overhead.

Cost-Effectiveness

Cost per exaFLOPS per hour:

H100 cloud pricing ($3/hour): $0.146 per exaFLOPS
B200 cloud pricing ($6/hour): $0.206 per exaFLOPS
NVL72 estimated pricing ($60/GPU/hour): $0.0285 per exaFLOPS

The NVL72 is 5-7x cheaper per exaFLOPS when fully utilized. However, upfront commitment and minimum rental periods favor B200 or H100 for exploratory work.

When to Choose Each

Use H100 if:

Training models under 1 trillion parameters
Need flexibility and spot pricing
Budget constraints are tight
Multi-cloud strategy is important

Use B200 if:

Training 1-10 trillion parameter models
Can commit to full-time utilization
Cost-per-compute dominates over flexibility

Use NVL72 if:

Training 10+ trillion parameter models
Distributed training is non-negotiable
Hyperscaler infrastructure exists in-country
Willing to commit capital or long-term contracts

Training Time Comparison (Detailed)

Wall-clock training time for a 30B parameter model on 10B tokens (typical research-scale experiment):

System	Batch Size	Time to Train
Single H100	32	72 hours
8x H100 cluster (DDP)	256	14 hours
8x H100 cluster (TP)	512	12 hours
Single B200	64	48 hours
8x B200 cluster (DDP)	512	8 hours
NVL72 (72x B200 mix)	4096	1.2 hours

The NVL72 completes in 1.2 hours vs. 72 hours for single H100 (60x faster). Real-world improvements are 40-55x after accounting for synchronization overhead and I/O bottlenecks.

FAQ

Q: Can I buy a single GB200 GPU? No. The GB200 is sold as part of the NVL72 rack system. NVIDIA does not offer individual GB200 cards. The tight integration with Grace CPU makes standalone operation impractical.

Q: Does GB200 work with existing AI frameworks? Yes. PyTorch, JAX, and TensorFlow support GB200 via standard CUDA/ROCm implementations. No rewriting needed, though optimization for InfiniBand collective operations is recommended.

Q: How much power does an NVL72 rack consume? Approximately 70-80kW under full utilization. Planning data centers requires 10-15A of 415V three-phase power per rack, plus cooling capacity for 70kW thermal dissipation.

Q: What is the expected lifespan of GB200? NVIDIA typically supports GPUs for 3-5 years with security updates and driver improvements. The B200 was announced in October 2024, so mainstream availability extends through 2029-2031.

Q: Can I use GB200 NVL72 for inference? Yes, but it's overkill for most inference workloads. A single NVL72 can serve 1,000-10,000 concurrent users of a 10B model with sub-100ms latency. Hyperscalers use smaller systems (single B200, or H100 clusters) for inference cost efficiency.

Q: How does NVL72 cooling work? Liquid cooling is mandatory. Each GPU pair has an integrated pump module. Facility cooling must provide 40-50F inlet temperature water. Hyperscalers typically use chilled water loops or immersion cooling systems.

Q: What is the interconnect latency for all-reduce? Sub-microsecond on single rack via InfiniBand. Rack-to-rack communication via Quantum-3 adds 5-10 microseconds per hop. This overhead is manageable for model-parallel training but becomes expensive for extreme-scale pipeline parallelism.

Q: How does GB200 compare to custom ASIC systems (like TPU v6)? Custom ASICs (TPU, Cerebras, Graphcore) offer higher peak throughput but less flexibility. GB200 is general-purpose with software support across PyTorch, JAX, and TensorFlow. TPU requires XLA/JAX for peak performance. For rapid research iteration, GB200 wins on flexibility. For production-optimized systems, custom silicon may win on cost-per-FLOP.

Q: Can I split a single NVL72 rack between multiple teams? Not easily. NVIDIA sells the NVL72 as a single unit. Partitioning would require administrative overhead (SLURM job scheduling, quota enforcement). Most hyperscalers treat each NVL72 as a single tenant or allocate via time-slicing (e.g., Team A gets 12 hours, Team B gets 12 hours daily). This reduces per-team performance due to context resets. Some advanced operators use NVIDIA MIG-X (Multi-Instance GPU partitioning) but this hasn't reached production on NVL72 yet.

Q: What software frameworks are optimized for GB200 NVL72? PyTorch (via CUDA 12.5+), JAX, TensorFlow/XLA, Megatron-LM, DeepSpeed, and FSDP all support GB200. Most frameworks require minimal code changes. The bigger optimization opportunity is in gradient compression and communication overlap, which are handled by distributed training libraries (DeepSpeed's ZeRO, FSDP's fully-sharded mode).

Q: How does maintenance/downtime work for NVL72 systems? Hyperscalers typically perform rolling maintenance on subset of capacity while keeping other racks operational. A single NVL72 outage impacts all 72 GPUs. Hyperscalers aim for 99%+ uptime via redundant cooling, power, and networking. For critical training runs, users can implement checkpointing every 1-2 hours to minimize data loss if hardware fails unexpectedly.

Q: What is the environmental impact of an NVL72 rack? 70kW power consumption + cooling overhead (typically 1.3-1.5x PUE) = ~100-105kW total facility power per rack. For a 1,000-hour training run: 100kWh = equivalent to 10 tons CO2 (electricity source dependent). Carbon-aware scheduling (training during low-carbon hours) can reduce environmental impact by 40-60%. Some hyperscalers offset via renewable energy credits.

NVIDIA Blackwell Architecture Deep Dive
B200 vs H100: Performance & Cost Analysis
Distributed Training on GPU Clusters
InfiniBand vs Ethernet for AI Workloads
Liquid Cooling for Data Centers

Sources

NVIDIA. "Blackwell and GB200 Specifications." nvidia.com/en-us/data-center/blackwell/
NVIDIA. "NVL72 System Architecture." nvidia.com/en-us/data-center/nvl72/
NVIDIA Hopper to Blackwell Product Brief (October 2024)
AWS EC2 p5/p5e Instance Documentation (2026)
Google Cloud TPU and GPU Pricing (March 2026)

Contents