H100 SXM vs PCIe: Specs, Benchmarks & Cloud Pricing Compared

H100 SXM vs PCIe
Form Factor: SXM vs PCIe
Memory and Bandwidth Specifications
NVLink Architecture Differences
Power Consumption Profiles
Compute Performance Comparison
Multi-GPU Scaling Efficiency
Cloud Pricing Comparison
Real-World Performance Benchmarks
When to Choose Each Variant
FAQ
Related Resources
Sources

H100 SXM vs PCIe

H100 SXM vs PCIE is the focus of this guide. SXM: NVLink 4 at 900 GB/s. Required for multi-GPU training. Cost: $2.69/hr (RunPod).

PCIe: PCIe 5.0 at 128 GB/s. Cheaper, single/dual GPU only. Cost: $1.99/hr (RunPod).

Same compute core. Different connectivity. SXM is 35% more expensive. Question: does NVLink bandwidth justify the cost for the workload?

Form Factor: SXM vs PCIe

SXM: Requires DGX servers or compatible OEM hardware. Integrated NVLink on motherboard. Coupled thermal management with chassis. No flexibility.

PCIe: Standard x16 slots. Works in any server with PCIe 5.0. Modular. Passive cooling. Flexible deployment.

In cloud: Both are pre-configured. Doesn't matter. What matters: bandwidth and price.

Memory and Bandwidth Specifications

Different Memory Types

H100 SXM and PCIe variants use different memory types with different bandwidth:

H100 SXM:

80GB HBM3 memory
6 HBM3 stacks of ~13.3GB each
Total memory bandwidth: 3.35 TB/s

H100 PCIe:

80GB HBM2e memory
Total memory bandwidth: 2.0 TB/s

This is a significant difference: the SXM variant's HBM3 provides 67% more memory bandwidth than the PCIe variant's HBM2e. For memory-bound workloads such as large language model inference, this bandwidth gap translates directly to throughput differences.

GPU-Internal Bandwidth

The SXM variant's HBM3 reaches 3.35 TB/s peak bandwidth while the PCIe variant's HBM2e reaches 2.0 TB/s. Practical sustained bandwidth on the SXM variant reaches 2.5-3 TB/s for dense workloads; the PCIe variant sustains roughly 1.5-1.8 TB/s. This bandwidth difference is material for large-batch inference and training workloads.

GPU-to-GPU Connectivity

GPU-to-GPU bandwidth differs dramatically between variants.

H100 SXM GPU-to-GPU Bandwidth:

NVLink 4 interconnect: 900 GB/s bidirectional (450 GB/s unidirectional per direction)
Four NVLink connections per GPU
Eight-GPU SXM cluster provides 1.8 TB/s theoretical all-reduce bandwidth
Practical sustained rates: 1.2-1.5 TB/s (all-reduce efficiency 65-85%)

H100 PCIe GPU-to-GPU Bandwidth:

PCIe 5.0 x16 connection: 128 GB/s unidirectional (256 GB/s bidirectional)
PCIe 5.0 switch fabrics support up to 64-128 GB/s aggregate cross-sectional bandwidth
Eight-GPU PCIe cluster maximum aggregate: ~256 GB/s practical
Practical sustained rates: 150-200 GB/s (PCIe switch contention reduces peak)

The difference is stark: 1.5 TB/s (SXM) versus 0.2 TB/s (PCIe) represents 7.5x bandwidth advantage for SXM clusters. This single metric explains nearly all performance differences in multi-GPU configurations.

NVLink Architecture Differences

NVLink 4 in H100 SXM

NVLink 4 is a specialized, proprietary interconnect designed specifically for high-speed GPU-to-GPU communication. Each H100 SXM includes four independent NVLink 4 connections. The network topology varies:

GPU pairs (2 GPUs) connect with full NVLink cross-connectivity: 4 connections per direction, 1.8 TB/s bidirectional.

4-GPU configurations (typical for single DGX system) create a hybrid topology: 4 GPUs arranged in pairs with full NVLink connectivity between pairs and reduced cross-pair bandwidth through shared links. This enables 1.5+ TB/s all-reduce bandwidth through careful algorithm mapping.

8-GPU configurations (two DGX systems) bridge via PCIe links, creating a bottleneck. The inter-chassis NVLink bandwidth is limited by PCIe host connectivity, reducing all-reduce bandwidth to 0.3-0.5 TB/s between chassis.

NVLink 4 benefits from microsecond-scale latency. Direct GPU-to-GPU communication bypasses system memory, host processor, or network switches entirely. Messages transit through specialized silicon in approximately 1-5 microseconds, versus PCIe which incurs 10-20 microsecond latencies.

PCIe 5.0 Ecosystem

PCIe 5.0 provides 32 GB/s per lane; x16 configuration yields 512 GB/s nominal, but practical deployment uses bidirectional 256 GB/s (128 GB/s per direction). However, this assumes direct connection without intermediate switching.

In multi-GPU PCIe clusters, GPUs typically connect through PCIe switch fabrics. A 64-port PCIe 5.0 switch provides cross-sectional bandwidth of 64 * 32 GB/s = 2 TB/s. However, peak throughput is limited to half cross-sectional bandwidth (1 TB/s in full-mesh scenarios), and practical implementations with 8 GPUs achieve 200-250 GB/s aggregate.

PCIe 5.0 latency is higher than NVLink: approximately 200-500 nanoseconds base latency plus switching overhead. For large messages (kilobytes or megabytes), bandwidth dominates and latency matters less. For small control messages (cache coherency, synchronization), latency becomes more significant.

Power Consumption Profiles

H100 PCIe Power Specifications

H100 PCIe specifies maximum power consumption of 350 watts. This rating represents sustained peak performance under full utilization. Practical workloads typically consume:

Peak GPU utilization: 320-350 watts
Typical training: 300-330 watts
Inference (large batch): 280-310 watts
Inference (small batch, memory-bound): 150-200 watts

The relatively modest power consumption reflects PCIe targeting both data centers and non-hyperscale environments where power density constraints are less severe.

Thermal design point (TDP) of 350 watts assumes case temperature of approximately 70-75°C. In deployments where case temperature exceeds 80°C, throttling may reduce sustained performance by 5-10%.

H100 SXM Power Specifications

H100 SXM specifies maximum power consumption of 700 watts. This 2x increase reflects additional system integration:

GPU package: 350-400 watts (similar to PCIe)
HBM3 memory: increased power due to higher clock rates in SXM variants
NVLink signaling: additional power for high-bandwidth interconnect
Integrated power delivery: additional conversion losses in the module itself

SXM systems often include dual H100 SXM modules (1,400 watts per system), creating substantial power infrastructure requirements. DGX H100 systems require 24 amp 208V or higher power delivery.

The higher power consumption creates infrastructure implications. Data centers require dedicated power monitoring, larger power supplies, and more sophisticated cooling. Single GPUs or small dual-GPU deployments rarely justify this infrastructure investment.

Compute Performance Comparison

Single-GPU Performance Parity

Both variants deliver identical peak compute performance:

66.9 teraFLOPS (TFLOPS) FP32 peak
989 TFLOPS TF32 (for tensor operations, with sparsity: 1,979 TFLOPS)
3,958 TFLOPS FP8 (for quantized operations, with sparsity)

These figures are identical between SXM and PCIe variants. A Llama 70B inference task runs identically fast on both, provided GPU memory isn't shared with other processes. Single-GPU workload latency is identical.

Sustained performance under realistic utilization patterns is also identical. Both GPUs reach 85-90% of theoretical peak for dense matrix multiplication workloads typical of training and large-batch inference.

Multi-GPU Performance Differences

Multi-GPU scaling efficiency differs substantially:

H100 SXM eight-GPU cluster:

Per-GPU throughput: 66.9 TFLOPS
Aggregate theoretical: 535 TFLOPS
All-reduce efficiency: 75-80%
Practical aggregate: 400-430 TFLOPS sustained during synchronized training

H100 PCIe eight-GPU cluster:

Per-GPU throughput: 66.9 TFLOPS
Aggregate theoretical: 535 TFLOPS
All-reduce efficiency: 15-25% due to PCIe bottleneck
Practical aggregate: 80-135 TFLOPS sustained during synchronized training

The efficiency differences stem purely from gradient synchronization bandwidth. SXM's 1.5 TB/s bandwidth enables rapid all-reduce operations; PCIe's 200 GB/s limits synchronization speed, requiring training algorithms to reduce synchronization frequency.

Multi-GPU Scaling Efficiency

Linear Scaling Prerequisites

Both systems achieve near-linear scaling (efficiency > 90%) when scaling from 1 to 4 GPUs within a single system:

1 GPU: baseline
2 GPUs: 1.9x speedup (95% efficiency)
4 GPUs (SXM) or 2 GPUs (PCIe with diminishing returns): 3.8x or 2.0x speedup

The difference: SXM systems in 4-GPU configuration maintain NVLink bandwidth saturation. PCIe systems see bandwidth bottlenecks earlier due to shared PCIe fabric.

Scaling Beyond Four GPUs

SXM scaling to 8 GPUs:

Efficiency: 75-85% (practical all-reduce takes 1.5-2x longer than theoretical minimum)
Speedup: 6.5-6.8x relative to single GPU

PCIe scaling to 8 GPUs:

Efficiency: 20-30% (all-reduce becomes dominant bottleneck)
Speedup: 1.6-2.4x relative to single GPU

At eight-GPU scale, SXM achieves 3-4x faster distributed training than PCIe. The difference justifies cloud pricing premiums when training large models requiring 8+ GPU-hours.

Communication Topology Optimization

SXM systems benefit from sophisticated communication patterns exploiting NVLink topology. Researchers have developed algorithms (like those used in NCCL optimizations) that avoid full-mesh communication, instead using ring-reduce and tree-reduce patterns that saturate NVLink bandwidth while minimizing contention.

PCIe systems can implement similar algorithms but are bandwidth-limited regardless of topology. Even optimal ring-reduce patterns are limited by PCIe 5.0 x16 bandwidth.

Cloud Pricing Comparison

Hourly Rate Analysis (March 2026)

RunPod Pricing:

H100 PCIe: $1.99/hour
H100 SXM: $2.69/hour
SXM premium: 35%

Lambda Labs Pricing:

H100 PCIe: $2.86/hour
H100 SXM: $3.78/hour
SXM premium over PCIe: 32%

The pricing spread varies by provider. RunPod shows a 35% SXM premium; Lambda Labs shows a 32% SXM premium. For both providers, SXM is more expensive than PCIe.

Cost Per Unit Throughput

Throughput cost analysis depends on workload:

Large distributed training (8 GPUs, batch size 512):

PCIe cluster: 8 × $1.99 = $15.92/hour, training throughput limited by communication to ~2.4 GPU-equivalents effective
SXM cluster: 8 × $2.69 = $21.52/hour, training throughput ~6.8 GPU-equivalents effective
Cost per GPU-equivalent throughput: PCIe $6.63, SXM $3.16 (SXM is 52% cheaper per unit throughput)

Single-GPU inference:

PCIe: $1.99/hour
SXM: $2.69/hour
SXM is 35% more expensive with identical performance

The break-even point occurs around 2-4 GPUs for distributed training scenarios. Single GPU or small 2-GPU clusters favor PCIe; 4+ GPU clusters favor SXM.

Total Cost of Ownership

For teams provisioning private infrastructure (not cloud):

PCIe 8-GPU system:

Hardware: ~$80,000 (8 GPUs at ~$10k each)
Server: ~$15,000
Cooling: ~$8,000
Power supply: ~$4,000
3-year depreciation: ~$102,000/3 = $34,000/year
Annual electricity: 8 × 350W × 8,760 hours × $0.12/kWh = $29,351/year
Total annual cost: $63,351

SXM 8-GPU system (DGX H100 equivalent):

Hardware: ~$398,000 (DGX H100 list price)
3-year depreciation: $398,000/3 = $132,667/year
Annual electricity: 8 × 700W × 8,760 hours × $0.12/kWh = $58,838/year
Total annual cost: $191,505

For training workloads where SXM provides 3x effective throughput, annualized cost per unit throughput favors SXM (if not significantly). However, the absolute capital investment is 5x higher, making SXM suitable only for teams planning high-utilization deployments.

Real-World Performance Benchmarks

Training Throughput Measurements

Llama 7B fine-tuning (8-GPU cluster, batch size 64 per GPU):

H100 SXM: 6,200 tokens/second (all-reduce takes 2ms per step)
H100 PCIe: 1,200 tokens/second (all-reduce takes 12ms per step, limits frequency to ~80 steps/second)
Ratio: 5.2x advantage to SXM

Llama 70B fine-tuning (8-GPU cluster, batch size 8 per GPU):

H100 SXM: 800 tokens/second
H100 PCIe: 200 tokens/second
Ratio: 4x advantage to SXM

The ratios exceed pure bandwidth ratios (1.5 TB/s vs 0.2 TB/s = 7.5x) because PCIe systems use gradient accumulation and reduced synchronization frequency, recovering some efficiency.

Inference Throughput

Llama 70B serving at 100 concurrent requests (40K tokens average context):

H100 SXM: 900 tokens/second output throughput
H100 PCIe (single GPU): 900 tokens/second
Multi-GPU serving doesn't benefit token/second metrics; SXM provides no advantage

For inference, multi-GPU setup primarily benefits latency (reduced tail latencies through batching) and throughput (more requests served). PCIe serves this role adequately.

When to Choose Each Variant

Select H100 PCIe When

Choose PCIe for single-GPU or dual-GPU deployments. The cost savings (35% less per hour) are substantial for small clusters. Performance is identical.

PCIe is appropriate for inference services not requiring distributed serving. Inference workloads rarely bottleneck on GPU-to-GPU communication; single-GPU instances or small 2-GPU clusters handle most request volumes.

PCIe makes sense when infrastructure already emphasizes modularity and flexibility. Custom servers, cloud deployments, and non-standardized environments benefit from PCIe compatibility with diverse host platforms.

Select H100 SXM When

Choose SXM for 4+ GPU training clusters. The throughput advantage (3-4x for 8 GPUs) justifies the 35% cost premium. Break-even occurs around 4-GPU clusters for training workloads.

SXM is appropriate for teams prioritizing scalability. If training frequently uses 8+ GPUs, SXM provides consistent 75%+ efficiency at scale versus 20% efficiency for PCIe.

SXM makes sense for teams invested in NVIDIA's DGX ecosystem. If existing infrastructure uses DGX systems, additional SXM modules work efficiently together.

Hybrid Approaches

Large teams often deploy mixed clusters: SXM systems for training (high-utilization, large batches) and PCIe systems for inference and development (lower utilization, flexible scaling). This balances cost and performance characteristics.

Cloud providers benefit from offering both. Customers self-select based on workload: training teams choose SXM, inference teams choose PCIe.

FAQ

Can PCIe clusters train large models efficiently?

Yes, but with caveats. Gradient accumulation and reduced synchronization frequency recover some efficiency. 8-GPU PCIe training achieves approximately 20-30% efficiency versus theoretical 100%, while SXM achieves 75-85%.

Is the NVLink PCIe bridge in DGX systems reliable?

Yes, but it becomes a bandwidth bottleneck. Inter-system communication is limited to PCIe x16 speeds (~250 GB/s), creating a 6x bottleneck relative to intra-system NVLink communication. Avoid inter-system gradient synchronization for best performance.

Can I upgrade PCIe H100 to SXM later?

No. The form factors and motherboards are completely different. Migration requires purchasing new systems.

What about power infrastructure requirements?

PCIe H100 at 350W per GPU creates modest requirements. SXM at 700W per GPU requires substantial power delivery. For 8-GPU SXM (5,600W), most data centers need dedicated power distribution.

Which is better for inference at scale?

For single-GPU inference throughput, both are identical. For serving thousands of concurrent requests, multiple smaller PCIe systems often cost less than fewer SXM systems while providing similar aggregate throughput.

How does cooling differ between variants?

PCIe requires external cooling (heatsinks, fans); SXM includes integrated cooling into the server chassis. For data centers with sophisticated cooling, both work well. For edge deployments, PCIe's external cooling is easier to maintain.

Review comprehensive H100 specifications at NVIDIA H100 GPU models.

Understand H100 pricing trends in H100 cloud pricing analysis.

Compare with predecessor generation in A100 vs H100 benchmarks.

Contents