Contents
- Nvidia Gb200 Nvl72: Overview
- What Is the GB200 NVL72?
- Hardware Specs & Architecture
- Performance Metrics & Benchmarks
- Cloud Pricing & Availability
- Use Cases & Deployment Scenarios
- GB200 NVL72 vs H100 vs B200
- FAQ
- Related Resources
- Sources
Nvidia Gb200 Nvl72: Overview
The NVIDIA GB200 NVL72 represents the cutting edge of hyperscaler AI infrastructure, combining the Grace CPU with the Blackwell GPU in a liquid-cooled, 72-GPU rack system. This architecture delivers massive parallelism for trillion-parameter model training and inference at scale. Understanding the GB200 NVL72 specs and cloud pricing helps teams evaluate whether this system fits their training requirements and operational budgets.
The NVL72 is purpose-built for sovereign AI initiatives and hyperscaler workloads that demand extreme density. Unlike traditional GPU clusters, this system integrates compute, memory, and cooling into a single optimized unit, eliminating many deployment bottlenecks.
| Specification | Value |
|---|---|
| GB200 Superchips per Rack | 36 |
| Total B200 GPUs per Rack | 72 |
| Total Grace CPUs per Rack | 36 |
| GPU Type | NVIDIA Blackwell (B200) |
| Peak FP4 Performance | 1.4 exaFLOPS |
| Total HBM3e Memory | 13.5TB |
| Connectivity | InfiniBand / NVLink |
| Power Per Rack | ~70kW |
| Cooling | Liquid-cooled only |
| Form Factor | Full-rack system |
What Is the GB200 NVL72?
The GB200 NVL72 is not just a GPU. It's a unified computing platform pairing NVIDIA's Grace CPU with Blackwell GPUs. Each GB200 Superchip pairs one 144-core Grace CPU with two B200 GPUs via NVLink-C2C. The NVL72 contains 36 GB200 Superchips per rack, totaling 72 B200 GPUs and 36 Grace CPUs connected via NVLink Switch.
Grace CPU + Blackwell GPU Pairing
The Grace processor is a 144-core Arm-based CPU optimized for AI workloads. It shares unified memory with the two adjacent B200 GPUs via NVLink-C2C at 900 GB/s, enabling zero-copy data sharing. This tight coupling eliminates PCIe bottlenecks that plague traditional multi-socket systems.
Each B200 GPU has 192GB of HBM3e DRAM. Each Grace CPU contributes 480GB of LPDDR5X memory. Combined, each GB200 Superchip has 192GB × 2 + 480GB = 864GB addressable memory, with 36 Superchips per rack giving approximately 31TB total system memory (13.5TB HBM3e GPU memory across 72 GPUs).
NVL72 Rack Architecture
The NVL72 organizes 72 GB200 modules in a liquid-cooled form factor. All 72 GPUs connect via InfiniBand in-network computing (Quantum-3 or Spectrum-X for ultra-low latency). The interconnect fabric sustains microsecond-scale all-reduce operations essential for distributed training.
This tight density comes with tradeoffs. Developers cannot buy individual GB200 GPUs. Hyperscalers must purchase entire racks or negotiate smaller configurations. Entry cost is substantial. There's no per-card pricing model.
Design Philosophy
NVIDIA positioned the GB200 NVL72 for a specific segment: companies training models at trillion-parameter scale with massive budgets. The system assumes:
- Liquid cooling infrastructure already exists
- Power delivery can handle 70kW continuous draw
- Networking engineers can orchestrate InfiniBand fabrics
- Workloads can tolerate tightly coupled, immobile hardware
This is not a flexible multi-cloud GPU offering. This is infrastructure for dedicated AI labs and hyperscalers building their own data centers.
Hardware Specs & Architecture
GPU Specifications
NVIDIA Blackwell B200
- Streaming Multiprocessors: 148 SMs (dual-die: 74 SMs per die)
- CUDA Cores: 16,896 cores (on B200)
- Tensor Performance: ~9 petaFLOPS FP8 per GPU; ~1.4 exaFLOPS FP4 per NVL72 rack
- Memory Bandwidth: 8.0 TB/s per GPU
- Memory Capacity: 192GB HBM3e per GPU
- Max Power: ~1,000W per GPU at full utilization
The Blackwell architecture introduces fifth-generation Tensor Cores supporting FP4, FP8, and mixed-precision training. Native support for FP4 training enables higher throughput than previous generations at equivalent hardware cost.
System-Level Architecture
Memory Subsystem
- 13.5TB total HBM3e distributed across 72 GPUs
- Unified memory space between each Grace CPU and B200 pair
- Coherent caches enable lock-free synchronization primitives
Interconnect Fabric
- 72 GPUs connected via NVIDIA NVLink-ultra (400 GB/s per link)
- InfiniBand Quantum or Spectrum-X for rack-to-rack communication
- Latency: Sub-microsecond collective operations on single rack
Thermal Management
- Liquid cooling mandatory (no passive or air-cooled variants)
- Dual-loop cooling for redundancy
- Integrated pump modules per GPU pair
- ~70kW total thermal load per full rack
Power Distribution
- Requires three-phase 415V input (in many regions)
- Redundant PSUs with N+1 failover
- PDU integration for smart power metering
Performance Metrics & Benchmarks
Compute Density
The NVL72 delivers extreme density in a single rack.
- FP4 Performance: 1.4 exaFLOPS per rack (confirmed via NVIDIA datasheet)
- FP8 Performance: 720 PFLOPS (0.72 exaFLOPS) per rack (dense)
- Dense Precision (higher accuracy): Scales linearly but at reduced throughput
For reference, a rack of 8x H100 GPUs achieves roughly 400 teraFLOPS in FP8. The NVL72 is 3,500x denser. Density matters when training 100+ trillion-parameter models.
Memory Bandwidth
Each GPU sustains 8.0 TB/s from its local HBM3e. Under perfect scaling, a full rack sees approximately 576 TB/s aggregate memory bandwidth across all 72 GPUs. Real-world collective operations (all-reduce, broadcast) consume ~70-80% of this due to synchronization overhead.
Collective Communication Performance
Distributed training requires moving weight gradients between GPUs. The NVL72's InfiniBand interconnect achieves microsecond-scale all-reduce latency.
Latency vs. H100 Clusters:
- H100 cluster (8-GPU nodes + Ethernet): 100-500 microseconds per collective
- NVL72 (single rack): 5-20 microseconds per collective
This 10-50x latency reduction compounds across millions of training steps, translating to 5-15% wall-clock speedup in large-batch distributed training.
Thermal Envelope
The liquid-cooled design enables sustained maximum clocks without thermal throttling. ClockBoost technology maintains peak frequency longer than air-cooled systems, yielding consistent performance across training workloads.
Sparsity & MoE Support
Blackwell native sparsity support accelerates Mixture-of-Experts (MoE) models. The architecture dynamically skips inactive computation paths, reducing effective FLOPs.
For a 1 trillion parameter MoE model with 4:16 sparsity (1/4 experts active), effective compute is:
- Without sparsity optimization: 720 PFLOPS FP8 (dense)
- With sparsity: 1.44 exaFLOPS FP4 (sparse)
This is critical for inference and training at extreme scale. H100 doesn't have native sparsity, requiring manual workarounds.
Precision & Quantization
B200 supports multiple numerical formats natively:
| Precision | Performance | Use Case |
|---|---|---|
| FP4 | 1.44 exaFLOPS (sparse) / 720 PFLOPS (dense) | Training, ultra-low memory |
| FP8 | 720 PFLOPS (dense) | Training, inference |
| BF16 | ~360 petaFLOPS | Inference, mixed-precision |
| FP32 | ~180 petaFLOPS | Validation, high accuracy |
Native FP4 is a game changer for trillion-parameter training. Previous generation (H100) required external quantization libraries, adding latency overhead. B200 handles FP4 arithmetic in hardware.
Cloud Pricing & Availability
As of March 2026, the GB200 NVL72 is not available through mainstream cloud providers as an on-demand, pay-per-hour service. NVIDIA released systems to select hyperscalers (AWS, Google Cloud, Azure) with limited general availability.
Current Availability Status
AWS: Tentative Q2 2026 availability under "p5e" instance family. Expected pricing . Reserved instances likely offered at 30-50% discount to on-demand rates.
Google Cloud: Announced but not yet available. Expected under "a4-megagpu" family. Pricing expected to align with H100 pricing multiplied by 1.3-1.5x (GB200 provides ~2.5x H100 performance in FP8).
Azure: Azure Quantum Elements includes GB200 via partnerships but not directly. Pricing through partners expected to be premium.
CoreWeave & Lambda: Specialty cloud providers with limited GB200 stock are onboarding systems. CoreWeave GPU pricing may include GB200 nodes as of Q2 2026. Expect hourly rates around $40-80 per GPU per hour (vs. $2-4 for H100).
Pricing Model Expectations
When available, pricing will likely follow this structure:
Per-GPU Hourly Rate
- Estimated $50-75/hour per B200 GPU (on-demand)
- $30-45/hour per GPU (1-year commitment)
- $20-30/hour per GPU (3-year commitment)
Full-Rack Pricing
- Estimated $3,600-5,400/hour for 72-GPU rack
- Monthly: $2.6-3.9M per rack
- Annual: $31.5-46.8M per rack at on-demand rates
These are estimates based on H100 pricing and announced NVIDIA positioning. Actual pricing depends on hyperscaler demand and capacity constraints.
Comparison to Alternatives
| System | Cost/Hour | Compute/$ | Availability |
|---|---|---|---|
| H100 SXM | $3-4 | 1.0x baseline | Wide |
| B200 (standalone) | $5-7 | 2.5x | Limited |
| GB200 NVL72 | $50-75/GPU | 3-4x | Very limited |
The GB200 NVL72 premium reflects not just raw performance but system reliability, interconnect quality, and hyperscaler margin. A single failed GPU in a 72-GPU rack is manageable. Thermal management and InfiniBand configuration are complex; hyperscalers charge accordingly.
Use Cases & Deployment Scenarios
Trillion-Parameter Model Training
The primary use case for GB200 NVL72 is training models in the 1-100 trillion parameter range. Companies like Google, Meta, and OpenAI are already using comparable systems internally.
Example: Training a 10 trillion parameter MoE model with 32 expert groups requires extreme memory and bandwidth. A single NVL72 rack provides enough memory for model weights, optimizer states, and activations. Distributed training across 8 racks scales to 100 trillion parameters feasibly.
Sovereign AI Initiatives
National AI programs in Europe, Japan, and others are purchasing GB200 systems to build independent large-scale training infrastructure. The systems stay in-country and offer strategic independence from US-based hyperscaler APIs.
Foundation Model Research
Academic institutions and research labs with sufficient funding can lease GB200 time from AWS or Google Cloud to benchmark novel training algorithms. The system's density and performance enable rapid iteration on large-scale experiments.
Real-Time Inference at Scale
While primarily for training, GB200 supports inference workloads requiring extreme parallelism. A single NVL72 rack can serve thousands of concurrent requests for moderately large models (10-100B parameters) at sub-100ms latency.
GB200 NVL72 vs H100 vs B200
Raw Performance Comparison
FP8 Throughput (Peak)
- H100 SXM: 20.5 petaFLOPS per GPU
- B200: 29.1 petaFLOPS per GPU (1.4x H100)
- NVL72 rack: 2.1 exaFLOPS (72x B200)
Memory Bandwidth per GPU
- H100 SXM: 3.35 TB/s
- B200: 8.0 TB/s
- NVL72: ~576 TB/s aggregate (72 × 8.0 TB/s per rack)
Practical Training Performance
Wall-clock time to train a 10B parameter model on 100M tokens:
- Single H100: ~8 hours
- 8x H100 (distributed): ~1.5 hours (5.3x improvement due to synchronization overhead)
- Single B200: ~5.6 hours
- 8x B200 (distributed): ~0.9 hours (6.2x improvement)
- NVL72 rack (72 B200): ~0.12 hours (70x improvement over single H100)
These numbers assume optimized training code. Real-world gains are 60-80% of theoretical due to synchronization, gradient all-reduce, and I/O overhead.
Cost-Effectiveness
Cost per exaFLOPS per hour:
- H100 cloud pricing ($3/hour): $0.146 per exaFLOPS
- B200 cloud pricing ($6/hour): $0.206 per exaFLOPS
- NVL72 estimated pricing ($60/GPU/hour): $0.0285 per exaFLOPS
The NVL72 is 5-7x cheaper per exaFLOPS when fully utilized. However, upfront commitment and minimum rental periods favor B200 or H100 for exploratory work.
When to Choose Each
Use H100 if:
- Training models under 1 trillion parameters
- Need flexibility and spot pricing
- Budget constraints are tight
- Multi-cloud strategy is important
Use B200 if:
- Training 1-10 trillion parameter models
- Can commit to full-time utilization
- Cost-per-compute dominates over flexibility
Use NVL72 if:
- Training 10+ trillion parameter models
- Distributed training is non-negotiable
- Hyperscaler infrastructure exists in-country
- Willing to commit capital or long-term contracts
Training Time Comparison (Detailed)
Wall-clock training time for a 30B parameter model on 10B tokens (typical research-scale experiment):
| System | Batch Size | Time to Train |
|---|---|---|
| Single H100 | 32 | 72 hours |
| 8x H100 cluster (DDP) | 256 | 14 hours |
| 8x H100 cluster (TP) | 512 | 12 hours |
| Single B200 | 64 | 48 hours |
| 8x B200 cluster (DDP) | 512 | 8 hours |
| NVL72 (72x B200 mix) | 4096 | 1.2 hours |
The NVL72 completes in 1.2 hours vs. 72 hours for single H100 (60x faster). Real-world improvements are 40-55x after accounting for synchronization overhead and I/O bottlenecks.
FAQ
Q: Can I buy a single GB200 GPU? No. The GB200 is sold as part of the NVL72 rack system. NVIDIA does not offer individual GB200 cards. The tight integration with Grace CPU makes standalone operation impractical.
Q: Does GB200 work with existing AI frameworks? Yes. PyTorch, JAX, and TensorFlow support GB200 via standard CUDA/ROCm implementations. No rewriting needed, though optimization for InfiniBand collective operations is recommended.
Q: How much power does an NVL72 rack consume? Approximately 70-80kW under full utilization. Planning data centers requires 10-15A of 415V three-phase power per rack, plus cooling capacity for 70kW thermal dissipation.
Q: What is the expected lifespan of GB200? NVIDIA typically supports GPUs for 3-5 years with security updates and driver improvements. The B200 was announced in October 2024, so mainstream availability extends through 2029-2031.
Q: Can I use GB200 NVL72 for inference? Yes, but it's overkill for most inference workloads. A single NVL72 can serve 1,000-10,000 concurrent users of a 10B model with sub-100ms latency. Hyperscalers use smaller systems (single B200, or H100 clusters) for inference cost efficiency.
Q: How does NVL72 cooling work? Liquid cooling is mandatory. Each GPU pair has an integrated pump module. Facility cooling must provide 40-50F inlet temperature water. Hyperscalers typically use chilled water loops or immersion cooling systems.
Q: What is the interconnect latency for all-reduce? Sub-microsecond on single rack via InfiniBand. Rack-to-rack communication via Quantum-3 adds 5-10 microseconds per hop. This overhead is manageable for model-parallel training but becomes expensive for extreme-scale pipeline parallelism.
Q: How does GB200 compare to custom ASIC systems (like TPU v6)? Custom ASICs (TPU, Cerebras, Graphcore) offer higher peak throughput but less flexibility. GB200 is general-purpose with software support across PyTorch, JAX, and TensorFlow. TPU requires XLA/JAX for peak performance. For rapid research iteration, GB200 wins on flexibility. For production-optimized systems, custom silicon may win on cost-per-FLOP.
Q: Can I split a single NVL72 rack between multiple teams? Not easily. NVIDIA sells the NVL72 as a single unit. Partitioning would require administrative overhead (SLURM job scheduling, quota enforcement). Most hyperscalers treat each NVL72 as a single tenant or allocate via time-slicing (e.g., Team A gets 12 hours, Team B gets 12 hours daily). This reduces per-team performance due to context resets. Some advanced operators use NVIDIA MIG-X (Multi-Instance GPU partitioning) but this hasn't reached production on NVL72 yet.
Q: What software frameworks are optimized for GB200 NVL72? PyTorch (via CUDA 12.5+), JAX, TensorFlow/XLA, Megatron-LM, DeepSpeed, and FSDP all support GB200. Most frameworks require minimal code changes. The bigger optimization opportunity is in gradient compression and communication overlap, which are handled by distributed training libraries (DeepSpeed's ZeRO, FSDP's fully-sharded mode).
Q: How does maintenance/downtime work for NVL72 systems? Hyperscalers typically perform rolling maintenance on subset of capacity while keeping other racks operational. A single NVL72 outage impacts all 72 GPUs. Hyperscalers aim for 99%+ uptime via redundant cooling, power, and networking. For critical training runs, users can implement checkpointing every 1-2 hours to minimize data loss if hardware fails unexpectedly.
Q: What is the environmental impact of an NVL72 rack? 70kW power consumption + cooling overhead (typically 1.3-1.5x PUE) = ~100-105kW total facility power per rack. For a 1,000-hour training run: 100kWh = equivalent to 10 tons CO2 (electricity source dependent). Carbon-aware scheduling (training during low-carbon hours) can reduce environmental impact by 40-60%. Some hyperscalers offset via renewable energy credits.
Related Resources
- NVIDIA Blackwell Architecture Deep Dive
- B200 vs H100: Performance & Cost Analysis
- Distributed Training on GPU Clusters
- InfiniBand vs Ethernet for AI Workloads
- Liquid Cooling for Data Centers
Sources
- NVIDIA. "Blackwell and GB200 Specifications." nvidia.com/en-us/data-center/blackwell/
- NVIDIA. "NVL72 System Architecture." nvidia.com/en-us/data-center/nvl72/
- NVIDIA Hopper to Blackwell Product Brief (October 2024)
- AWS EC2 p5/p5e Instance Documentation (2026)
- Google Cloud TPU and GPU Pricing (March 2026)