Contents
- H100 SXM vs PCIe
- Form Factor: SXM vs PCIe
- Memory and Bandwidth Specifications
- NVLink Architecture Differences
- Power Consumption Profiles
- Compute Performance Comparison
- Multi-GPU Scaling Efficiency
- Cloud Pricing Comparison
- Real-World Performance Benchmarks
- When to Choose Each Variant
- FAQ
- Related Resources
- Sources
H100 SXM vs PCIe
H100 SXM vs PCIE is the focus of this guide. SXM: NVLink 4 at 900 GB/s. Required for multi-GPU training. Cost: $2.69/hr (RunPod).
PCIe: PCIe 5.0 at 128 GB/s. Cheaper, single/dual GPU only. Cost: $1.99/hr (RunPod).
Same compute core. Different connectivity. SXM is 35% more expensive. Question: does NVLink bandwidth justify the cost for the workload?
Form Factor: SXM vs PCIe
SXM: Requires DGX servers or compatible OEM hardware. Integrated NVLink on motherboard. Coupled thermal management with chassis. No flexibility.
PCIe: Standard x16 slots. Works in any server with PCIe 5.0. Modular. Passive cooling. Flexible deployment.
In cloud: Both are pre-configured. Doesn't matter. What matters: bandwidth and price.
Memory and Bandwidth Specifications
Different Memory Types
H100 SXM and PCIe variants use different memory types with different bandwidth:
H100 SXM:
- 80GB HBM3 memory
- 6 HBM3 stacks of ~13.3GB each
- Total memory bandwidth: 3.35 TB/s
H100 PCIe:
- 80GB HBM2e memory
- Total memory bandwidth: 2.0 TB/s
This is a significant difference: the SXM variant's HBM3 provides 67% more memory bandwidth than the PCIe variant's HBM2e. For memory-bound workloads such as large language model inference, this bandwidth gap translates directly to throughput differences.
GPU-Internal Bandwidth
The SXM variant's HBM3 reaches 3.35 TB/s peak bandwidth while the PCIe variant's HBM2e reaches 2.0 TB/s. Practical sustained bandwidth on the SXM variant reaches 2.5-3 TB/s for dense workloads; the PCIe variant sustains roughly 1.5-1.8 TB/s. This bandwidth difference is material for large-batch inference and training workloads.
GPU-to-GPU Connectivity
GPU-to-GPU bandwidth differs dramatically between variants.
H100 SXM GPU-to-GPU Bandwidth:
- NVLink 4 interconnect: 900 GB/s bidirectional (450 GB/s unidirectional per direction)
- Four NVLink connections per GPU
- Eight-GPU SXM cluster provides 1.8 TB/s theoretical all-reduce bandwidth
- Practical sustained rates: 1.2-1.5 TB/s (all-reduce efficiency 65-85%)
H100 PCIe GPU-to-GPU Bandwidth:
- PCIe 5.0 x16 connection: 128 GB/s unidirectional (256 GB/s bidirectional)
- PCIe 5.0 switch fabrics support up to 64-128 GB/s aggregate cross-sectional bandwidth
- Eight-GPU PCIe cluster maximum aggregate: ~256 GB/s practical
- Practical sustained rates: 150-200 GB/s (PCIe switch contention reduces peak)
The difference is stark: 1.5 TB/s (SXM) versus 0.2 TB/s (PCIe) represents 7.5x bandwidth advantage for SXM clusters. This single metric explains nearly all performance differences in multi-GPU configurations.
NVLink Architecture Differences
NVLink 4 in H100 SXM
NVLink 4 is a specialized, proprietary interconnect designed specifically for high-speed GPU-to-GPU communication. Each H100 SXM includes four independent NVLink 4 connections. The network topology varies:
GPU pairs (2 GPUs) connect with full NVLink cross-connectivity: 4 connections per direction, 1.8 TB/s bidirectional.
4-GPU configurations (typical for single DGX system) create a hybrid topology: 4 GPUs arranged in pairs with full NVLink connectivity between pairs and reduced cross-pair bandwidth through shared links. This enables 1.5+ TB/s all-reduce bandwidth through careful algorithm mapping.
8-GPU configurations (two DGX systems) bridge via PCIe links, creating a bottleneck. The inter-chassis NVLink bandwidth is limited by PCIe host connectivity, reducing all-reduce bandwidth to 0.3-0.5 TB/s between chassis.
NVLink 4 benefits from microsecond-scale latency. Direct GPU-to-GPU communication bypasses system memory, host processor, or network switches entirely. Messages transit through specialized silicon in approximately 1-5 microseconds, versus PCIe which incurs 10-20 microsecond latencies.
PCIe 5.0 Ecosystem
PCIe 5.0 provides 32 GB/s per lane; x16 configuration yields 512 GB/s nominal, but practical deployment uses bidirectional 256 GB/s (128 GB/s per direction). However, this assumes direct connection without intermediate switching.
In multi-GPU PCIe clusters, GPUs typically connect through PCIe switch fabrics. A 64-port PCIe 5.0 switch provides cross-sectional bandwidth of 64 * 32 GB/s = 2 TB/s. However, peak throughput is limited to half cross-sectional bandwidth (1 TB/s in full-mesh scenarios), and practical implementations with 8 GPUs achieve 200-250 GB/s aggregate.
PCIe 5.0 latency is higher than NVLink: approximately 200-500 nanoseconds base latency plus switching overhead. For large messages (kilobytes or megabytes), bandwidth dominates and latency matters less. For small control messages (cache coherency, synchronization), latency becomes more significant.
Power Consumption Profiles
H100 PCIe Power Specifications
H100 PCIe specifies maximum power consumption of 350 watts. This rating represents sustained peak performance under full utilization. Practical workloads typically consume:
- Peak GPU utilization: 320-350 watts
- Typical training: 300-330 watts
- Inference (large batch): 280-310 watts
- Inference (small batch, memory-bound): 150-200 watts
The relatively modest power consumption reflects PCIe targeting both data centers and non-hyperscale environments where power density constraints are less severe.
Thermal design point (TDP) of 350 watts assumes case temperature of approximately 70-75°C. In deployments where case temperature exceeds 80°C, throttling may reduce sustained performance by 5-10%.
H100 SXM Power Specifications
H100 SXM specifies maximum power consumption of 700 watts. This 2x increase reflects additional system integration:
- GPU package: 350-400 watts (similar to PCIe)
- HBM3 memory: increased power due to higher clock rates in SXM variants
- NVLink signaling: additional power for high-bandwidth interconnect
- Integrated power delivery: additional conversion losses in the module itself
SXM systems often include dual H100 SXM modules (1,400 watts per system), creating substantial power infrastructure requirements. DGX H100 systems require 24 amp 208V or higher power delivery.
The higher power consumption creates infrastructure implications. Data centers require dedicated power monitoring, larger power supplies, and more sophisticated cooling. Single GPUs or small dual-GPU deployments rarely justify this infrastructure investment.
Compute Performance Comparison
Single-GPU Performance Parity
Both variants deliver identical peak compute performance:
- 66.9 teraFLOPS (TFLOPS) FP32 peak
- 989 TFLOPS TF32 (for tensor operations, with sparsity: 1,979 TFLOPS)
- 3,958 TFLOPS FP8 (for quantized operations, with sparsity)
These figures are identical between SXM and PCIe variants. A Llama 70B inference task runs identically fast on both, provided GPU memory isn't shared with other processes. Single-GPU workload latency is identical.
Sustained performance under realistic utilization patterns is also identical. Both GPUs reach 85-90% of theoretical peak for dense matrix multiplication workloads typical of training and large-batch inference.
Multi-GPU Performance Differences
Multi-GPU scaling efficiency differs substantially:
H100 SXM eight-GPU cluster:
- Per-GPU throughput: 66.9 TFLOPS
- Aggregate theoretical: 535 TFLOPS
- All-reduce efficiency: 75-80%
- Practical aggregate: 400-430 TFLOPS sustained during synchronized training
H100 PCIe eight-GPU cluster:
- Per-GPU throughput: 66.9 TFLOPS
- Aggregate theoretical: 535 TFLOPS
- All-reduce efficiency: 15-25% due to PCIe bottleneck
- Practical aggregate: 80-135 TFLOPS sustained during synchronized training
The efficiency differences stem purely from gradient synchronization bandwidth. SXM's 1.5 TB/s bandwidth enables rapid all-reduce operations; PCIe's 200 GB/s limits synchronization speed, requiring training algorithms to reduce synchronization frequency.
Multi-GPU Scaling Efficiency
Linear Scaling Prerequisites
Both systems achieve near-linear scaling (efficiency > 90%) when scaling from 1 to 4 GPUs within a single system:
- 1 GPU: baseline
- 2 GPUs: 1.9x speedup (95% efficiency)
- 4 GPUs (SXM) or 2 GPUs (PCIe with diminishing returns): 3.8x or 2.0x speedup
The difference: SXM systems in 4-GPU configuration maintain NVLink bandwidth saturation. PCIe systems see bandwidth bottlenecks earlier due to shared PCIe fabric.
Scaling Beyond Four GPUs
SXM scaling to 8 GPUs:
- Efficiency: 75-85% (practical all-reduce takes 1.5-2x longer than theoretical minimum)
- Speedup: 6.5-6.8x relative to single GPU
PCIe scaling to 8 GPUs:
- Efficiency: 20-30% (all-reduce becomes dominant bottleneck)
- Speedup: 1.6-2.4x relative to single GPU
At eight-GPU scale, SXM achieves 3-4x faster distributed training than PCIe. The difference justifies cloud pricing premiums when training large models requiring 8+ GPU-hours.
Communication Topology Optimization
SXM systems benefit from sophisticated communication patterns exploiting NVLink topology. Researchers have developed algorithms (like those used in NCCL optimizations) that avoid full-mesh communication, instead using ring-reduce and tree-reduce patterns that saturate NVLink bandwidth while minimizing contention.
PCIe systems can implement similar algorithms but are bandwidth-limited regardless of topology. Even optimal ring-reduce patterns are limited by PCIe 5.0 x16 bandwidth.
Cloud Pricing Comparison
Hourly Rate Analysis (March 2026)
RunPod Pricing:
- H100 PCIe: $1.99/hour
- H100 SXM: $2.69/hour
- SXM premium: 35%
Lambda Labs Pricing:
- H100 PCIe: $2.86/hour
- H100 SXM: $3.78/hour
- SXM premium over PCIe: 32%
The pricing spread varies by provider. RunPod shows a 35% SXM premium; Lambda Labs shows a 32% SXM premium. For both providers, SXM is more expensive than PCIe.
Cost Per Unit Throughput
Throughput cost analysis depends on workload:
Large distributed training (8 GPUs, batch size 512):
- PCIe cluster: 8 × $1.99 = $15.92/hour, training throughput limited by communication to ~2.4 GPU-equivalents effective
- SXM cluster: 8 × $2.69 = $21.52/hour, training throughput ~6.8 GPU-equivalents effective
- Cost per GPU-equivalent throughput: PCIe $6.63, SXM $3.16 (SXM is 52% cheaper per unit throughput)
Single-GPU inference:
- PCIe: $1.99/hour
- SXM: $2.69/hour
- SXM is 35% more expensive with identical performance
The break-even point occurs around 2-4 GPUs for distributed training scenarios. Single GPU or small 2-GPU clusters favor PCIe; 4+ GPU clusters favor SXM.
Total Cost of Ownership
For teams provisioning private infrastructure (not cloud):
PCIe 8-GPU system:
- Hardware: ~$80,000 (8 GPUs at ~$10k each)
- Server: ~$15,000
- Cooling: ~$8,000
- Power supply: ~$4,000
- 3-year depreciation: ~$102,000/3 = $34,000/year
- Annual electricity: 8 × 350W × 8,760 hours × $0.12/kWh = $29,351/year
- Total annual cost: $63,351
SXM 8-GPU system (DGX H100 equivalent):
- Hardware: ~$398,000 (DGX H100 list price)
- 3-year depreciation: $398,000/3 = $132,667/year
- Annual electricity: 8 × 700W × 8,760 hours × $0.12/kWh = $58,838/year
- Total annual cost: $191,505
For training workloads where SXM provides 3x effective throughput, annualized cost per unit throughput favors SXM (if not significantly). However, the absolute capital investment is 5x higher, making SXM suitable only for teams planning high-utilization deployments.
Real-World Performance Benchmarks
Training Throughput Measurements
Llama 7B fine-tuning (8-GPU cluster, batch size 64 per GPU):
- H100 SXM: 6,200 tokens/second (all-reduce takes 2ms per step)
- H100 PCIe: 1,200 tokens/second (all-reduce takes 12ms per step, limits frequency to ~80 steps/second)
- Ratio: 5.2x advantage to SXM
Llama 70B fine-tuning (8-GPU cluster, batch size 8 per GPU):
- H100 SXM: 800 tokens/second
- H100 PCIe: 200 tokens/second
- Ratio: 4x advantage to SXM
The ratios exceed pure bandwidth ratios (1.5 TB/s vs 0.2 TB/s = 7.5x) because PCIe systems use gradient accumulation and reduced synchronization frequency, recovering some efficiency.
Inference Throughput
Llama 70B serving at 100 concurrent requests (40K tokens average context):
- H100 SXM: 900 tokens/second output throughput
- H100 PCIe (single GPU): 900 tokens/second
- Multi-GPU serving doesn't benefit token/second metrics; SXM provides no advantage
For inference, multi-GPU setup primarily benefits latency (reduced tail latencies through batching) and throughput (more requests served). PCIe serves this role adequately.
When to Choose Each Variant
Select H100 PCIe When
Choose PCIe for single-GPU or dual-GPU deployments. The cost savings (35% less per hour) are substantial for small clusters. Performance is identical.
PCIe is appropriate for inference services not requiring distributed serving. Inference workloads rarely bottleneck on GPU-to-GPU communication; single-GPU instances or small 2-GPU clusters handle most request volumes.
PCIe makes sense when infrastructure already emphasizes modularity and flexibility. Custom servers, cloud deployments, and non-standardized environments benefit from PCIe compatibility with diverse host platforms.
Select H100 SXM When
Choose SXM for 4+ GPU training clusters. The throughput advantage (3-4x for 8 GPUs) justifies the 35% cost premium. Break-even occurs around 4-GPU clusters for training workloads.
SXM is appropriate for teams prioritizing scalability. If training frequently uses 8+ GPUs, SXM provides consistent 75%+ efficiency at scale versus 20% efficiency for PCIe.
SXM makes sense for teams invested in NVIDIA's DGX ecosystem. If existing infrastructure uses DGX systems, additional SXM modules work efficiently together.
Hybrid Approaches
Large teams often deploy mixed clusters: SXM systems for training (high-utilization, large batches) and PCIe systems for inference and development (lower utilization, flexible scaling). This balances cost and performance characteristics.
Cloud providers benefit from offering both. Customers self-select based on workload: training teams choose SXM, inference teams choose PCIe.
FAQ
Can PCIe clusters train large models efficiently?
Yes, but with caveats. Gradient accumulation and reduced synchronization frequency recover some efficiency. 8-GPU PCIe training achieves approximately 20-30% efficiency versus theoretical 100%, while SXM achieves 75-85%.
Is the NVLink PCIe bridge in DGX systems reliable?
Yes, but it becomes a bandwidth bottleneck. Inter-system communication is limited to PCIe x16 speeds (~250 GB/s), creating a 6x bottleneck relative to intra-system NVLink communication. Avoid inter-system gradient synchronization for best performance.
Can I upgrade PCIe H100 to SXM later?
No. The form factors and motherboards are completely different. Migration requires purchasing new systems.
What about power infrastructure requirements?
PCIe H100 at 350W per GPU creates modest requirements. SXM at 700W per GPU requires substantial power delivery. For 8-GPU SXM (5,600W), most data centers need dedicated power distribution.
Which is better for inference at scale?
For single-GPU inference throughput, both are identical. For serving thousands of concurrent requests, multiple smaller PCIe systems often cost less than fewer SXM systems while providing similar aggregate throughput.
How does cooling differ between variants?
PCIe requires external cooling (heatsinks, fans); SXM includes integrated cooling into the server chassis. For data centers with sophisticated cooling, both work well. For edge deployments, PCIe's external cooling is easier to maintain.
Related Resources
Review comprehensive H100 specifications at NVIDIA H100 GPU models.
Understand H100 pricing trends in H100 cloud pricing analysis.
Compare with predecessor generation in A100 vs H100 benchmarks.