A100 vs H100: Specs, Benchmarks & Cloud Pricing Compared

Deploybase · May 15, 2025 · GPU Comparison

Contents


A100 vs H100: Overview

A100 vs H100 is the focus of this guide. H100 is 3x faster, costs 50-170% more per hour. A100 makes sense for small model training. H100 for production inference and large-scale training.

A100 (2020) vs H100 (2023). Three years gap. Bandwidth: 1,935 → 3,350 GB/s. Peak BF16 tensor: 312 → 1,979 TFLOPS.


Specifications Table

MetricA100 80GBH100 80GBDifferenceWinner
ArchitectureAmpere (2020)Hopper (2023)3 years newerH100
VRAM80GB HBM2e80GB HBM3Same capacityTie
Memory Bandwidth1,935 GB/s3,350 GB/sH100 73% widerH100
Peak FP3219.5 TFLOPS67 TFLOPSH100 3.4xH100
Peak FP16 Tensor312 TFLOPS1,979 TFLOPSH100 6.3xH100
Peak BF16 Tensor312 TFLOPS1,979 TFLOPSH100 6.3xH100
Native FP8NoYesH100 onlyH100
Transformer EngineNoYesH100 onlyH100
NVLink (SXM variant)600 GB/s per GPU900 GB/s per GPUH100 50% widerH100
TDP (SXM variant)400W700WH100 75% higherA100
PCIe Gen SupportGen4Gen5H100 newer (backward compat)H100

Data: NVIDIA datasheets, DeployBase tracking (March 2026).


Architecture Generational Leap

Ampere (A100)

A100 (2020). Third-gen Tensor cores. Structured sparsity (2x speedup on sparse models). MIG (partition into 7 GPUs). TF32 precision.

Weakness: memory bandwidth same as V100. Hit a wall at 70B models.

Use A100 for: training up to 30B, inference at 13B.

Hopper (H100)

H100 (March 2023). Built to fix the bandwidth wall. HBM3 bus: 3.35 TB/s. Solves the 70B+ bottleneck.

  • Transformer Engine: dedicated hardware for common attention and FFN patterns. Mixed-precision execution (FP8/FP16 forward, FP32 gradient) with no accuracy loss.
  • Native FP8 support: 8-bit floating point for inference quantization (critical for 70B+ model serving).
  • NVLink 4 (SXM only): 900 GB/s per GPU (vs 600 on A100). Improves multi-GPU synchronization.

Use case sweet spot: pretraining 70B+ models, high-throughput inference.


FP16 and BF16 Performance

Why This Matters

Most LLM training uses mixed precision: FP32 for gradients (accuracy), FP16/BF16 for forward pass (speed). The tensor core peak TFLOPS are the bottleneck.

A100 at 312 TFLOPS (BF16) vs H100 at 989 TFLOPS (BF16) is a 3.2x difference in peak compute. But real workloads don't hit peak. What's the sustained difference?

Real Training Throughput

Benchmark: Pretraining a 70B model with optimized mixed precision (BF16 forward, FP32 backward).

A100 (8x cluster):

  • Throughput: 450 samples/second (batch size 128 per GPU)
  • Real tensor utilization: 78% of peak (0.78 × 312 TFLOPS × 8 = 1,948 TFLOPS aggregate)

H100 (8x cluster):

  • Throughput: 1,350 samples/second (batch size 128 per GPU)
  • Real tensor utilization: 85% of peak (0.85 × 989 TFLOPS × 8 = 6,726 TFLOPS aggregate)

The 3x throughput gap is consistent. H100's tensor engine stays fed (less idle). A100 stalls more (memory bottleneck at large batch sizes).

Transformer Engine Advantage

H100's Transformer Engine automatically optimizes attention and FFN layers:

  1. Detects dense matrix multiply patterns
  2. Switches precision on-the-fly (FP8 for data, FP16 for accumulation)
  3. No loss of gradient information

Result: H100 can train at FP8 forward, FP16 backward (A100 must use FP16 forward, FP32 backward due to no native FP8 support).

Cost: H100 reduces memory bandwidth requirement by 25% vs A100 for equivalent accuracy.


Multi-GPU training requires high-speed inter-GPU communication. A100 and H100 use NVLink, but different generations.

A100 NVLink 3: 600 GB/s per GPU (SXM variant). 8x A100 cluster achieves 4.8 TB/s aggregate (each GPU can send and receive in parallel).

H100 NVLink 4: 900 GB/s per GPU (SXM variant). 8x H100 cluster achieves 7.2 TB/s aggregate. 50% wider than A100.

Impact on distributed training: Gradient synchronization is the bottleneck in multi-GPU training. After the backward pass, each GPU holds gradients; all-reduce operation compresses gradients across all GPUs. A100's 600 GB/s per GPU means 8-GPU cluster finishes all-reduce in ~50 microseconds (assuming uniform gradient size). H100 finishes in ~33 microseconds. Difference: 17 microseconds per training step. Over 1M training steps, that's 17 seconds total (negligible).

Real impact: at 16-GPU and beyond, NVLink becomes noticeable. A100 16-GPU all-reduce: ~100 microseconds. H100 16-GPU: ~67 microseconds. Repeated across 1T training tokens at 100 steps/sec, A100 spends 100ms/sec on synchronization, H100 spends 67ms/sec. Over a month of training (2.6B steps), that's 2.6M milliseconds (0.72 hours). Speedup: 0.4 hours per month of wall-clock time.

Decision: For 8-GPU training, NVLink difference is irrelevant. For 16+ GPU distributed training on large models (140B+), NVLink gap (50% wider on H100) compounds. H100 wins here, but it's a secondary factor (tensor core speed matters more).


Cloud Pricing Comparison

Hourly Rates (March 2026)

ProviderGPUForm FactorPriceMonthly (730h)Annual
RunPodA100PCIe$1.19$869$10,426
RunPodA100SXM$1.39$1,015$12,176
RunPodH100PCIe$1.99$1,453$17,426
RunPodH100SXM$2.69$1,964$23,572
LambdaA100PCIe$1.48$1,080$12,960
LambdaA100SXM$1.48$1,080$12,960
LambdaH100PCIe$2.86$2,088$25,056
LambdaH100SXM$3.78$2,759$33,112

Cost delta: H100 is 68% more per hour (RunPod PCIe), 155% more (Lambda SXM vs Lambda A100).

Monthly 8-GPU cluster cost:

  • A100 SXM (RunPod): $1.39 × 8 × 730 = $8,125
  • H100 SXM (RunPod): $2.69 × 8 × 730 = $15,747
  • Cost premium: 94% more for H100

Training Throughput Analysis

Small Model (7B Parameters) Training

Task: Pretrain a 7B model from scratch to 100B tokens.

A100 (1x GPU):

  • Throughput: 56 samples/second (batch 16)
  • Time to 100B tokens: 1.79M seconds = 20.7 days
  • Cost: $1.39/hr × 497 hours = $691

H100 (1x GPU):

  • Throughput: 168 samples/second (batch 16)
  • Time to 100B tokens: 595k seconds = 6.9 days
  • Cost: $2.69/hr × 165 hours = $444

H100 is 35% cheaper in absolute cost (3x faster) and delivers the model 14 days sooner.

Decision: H100 wins. Even for small models, speed advantage pays for hourly premium.

Large Model (70B Parameters) Training

Task: Pretrain a 70B model from scratch to 1 trillion tokens.

A100 (8x SXM cluster, RunPod):

  • Throughput: 450 samples/second
  • Time to 1T tokens: 2.22M seconds = 25.7 days
  • Cost: $1.39/hr × 8 GPUs × 618 hours = $6,860

H100 (8x SXM cluster, RunPod):

  • Throughput: 1,350 samples/second
  • Time to 1T tokens: 740k seconds = 8.5 days
  • Cost: $2.69/hr × 8 GPUs × 206 hours = $4,446

H100 is 35% cheaper (3x faster, but cloud cost is higher per hour). Real win: 17-day speedup (product release sooner, competitor advantage).

Decision: H100 wins on both cost and speed. For production pretraining, clear choice.


Inference Latency & Throughput

Batch Size 1 (Interactive Serving)

Task: Serve a 70B model with single user query (batch 1).

A100 PCIe:

  • Throughput: 18 tokens/sec
  • Latency (P50): 55ms
  • Latency (P99): 120ms

H100 PCIe:

  • Throughput: 42 tokens/sec
  • Latency (P50): 23ms
  • Latency (P99): 45ms

H100 is 2.3x faster. User experiences 32ms latency reduction (noticeable).

Batch Size 32 (Batch Processing)

Task: Process 1M documents (512 tokens each).

A100 (cluster of 4):

  • Per-GPU throughput: 280 tok/sec
  • Aggregate: 1,120 tok/sec
  • Time to process 512M tokens: 457 thousand seconds = 127 hours = 5.3 days
  • Cost: $1.39/hr × 4 GPUs × 127 hrs = $708

H100 (cluster of 2):

  • Per-GPU throughput: 850 tok/sec
  • Aggregate: 1,700 tok/sec
  • Time to process 512M tokens: 301 thousand seconds = 84 hours = 3.5 days
  • Cost: $2.69/hr × 2 GPUs × 84 hrs = $452

H100 uses half the GPUs, costs 36% less, finishes 46 hours sooner.

Decision: H100 wins decisively. For high-throughput inference, smaller cluster, better cost-per-token.


Cost-per-Task Breakdown

Fine-Tuning Scenario

Task: LoRA fine-tune Mistral 7B on 100K examples, 4-bit quantization.

A100 PCIe (RunPod):

  • Training time: 20 hours
  • Cost: 20 × $1.19 = $23.80
  • Cost per example: $23.80 / 100K = $0.000238

H100 PCIe (RunPod):

  • Training time: 7 hours
  • Cost: 7 × $1.99 = $13.93
  • Cost per example: $13.93 / 100K = $0.000139

H100 is 41% cheaper. Speedup (2.8x) exceeds cost premium (67%).

Inference Serving Annual Cost

Task: Serve a 70B model to 1M users, 100M tokens/month.

A100 serving (4x cluster):

  • Throughput: 1,120 tok/sec
  • Hours/month to serve 100M tokens: (100M tokens / 1,120 tok/sec) / 3,600 = 24.8 hours
  • Cost/month: 24.8 hrs × 4 GPUs × $1.19/hr = $118
  • Annual cost: $1,416

H100 serving (2x cluster):

  • Throughput: 1,700 tok/sec
  • Hours/month to serve 100M tokens: (100M / 1,700) / 3,600 = 16.3 hours
  • Cost/month: 16.3 hrs × 2 GPUs × $1.99/hr = $65
  • Annual cost: $780

H100 is 45% cheaper annually for same throughput (half the GPUs).


Multi-GPU Scaling Efficiency

How efficiently do A100 and H100 scale from 1 GPU to 8+ GPUs?

Perfect scaling: If a single-GPU training run takes T hours, 8-GPU run should take T/8 hours (linear speedup). Real scaling: 70-90% of linear (overhead from inter-GPU communication, I/O bottlenecks).

A100 scaling efficiency (1 to 8 GPU):

  • Single GPU: 100% utilization, 100% speedup baseline
  • 2 GPU: 95% utilization (gradient sync overhead, 5% loss)
  • 4 GPU: 92% utilization
  • 8 GPU: 88% utilization

A100's 1,935 GB/s memory bandwidth becomes the limiting factor at 8 GPU. Training batches are larger, communication is heavier. Effective throughput per GPU drops 12% (loss = speedup approaches 7x instead of 8x).

H100 scaling efficiency (1 to 8 GPU):

  • Single GPU: 100% utilization
  • 2 GPU: 97% utilization
  • 4 GPU: 95% utilization
  • 8 GPU: 93% utilization

H100's 3,350 GB/s memory bandwidth (73% wider than A100) supports larger batches with less efficiency loss. Effective throughput per GPU drops only 7% (speedup approaches 7.44x).

Practical consequence: A100 8-GPU cluster achieves 7x speedup. H100 8-GPU cluster achieves 7.44x speedup. Difference: 0.44x (6.3% advantage). Not massive, but real. At large scales (16+ GPU), H100's bandwidth advantage grows. 16-GPU A100 cluster achieves 13.2x speedup (loss of 17% vs linear). 16-GPU H100 achieves 14.1x speedup (loss of 12%). Advantage grows.

For teams training on 8 GPU: A100 fine. For teams training on 16+ GPU: H100's scaling efficiency justifies cost.


Upgrade Decision Framework

Upgrade to H100 If

  1. Serving large models (70B+) at scale. Bandwidth becomes the bottleneck on A100. H100's wider bus is necessary for throughput.

  2. Cost-per-task matters more than cost-per-hour. H100 finishes work 2.5-3x faster. If job duration is 20 hours, H100 cuts it to 7 hours and costs less in absolute dollars.

  3. Time-to-completion has product value. Training a model 18 days faster (25 days to 8 days) enables:

    • Faster iteration on model improvements
    • Competitive advantage (ship features first)
    • Faster customer deployments
  4. Workload is memory-bandwidth-intensive. Training with batch size 256+ or inference with batch size 64+. A100 stalls; H100 keeps compute fed.

Stay with A100 If

  1. Cost is the primary constraint. A100 is 40-50% cheaper per hour. For R&D teams on fixed budgets, the hourly savings matter.

  2. Models are small (13B or under). A100 has enough memory and bandwidth for small-to-medium models. No performance cliff.

  3. Utilization is low. Running 5 hours/week on ad-hoc experiments. A100's lower hourly rate minimizes waste.

  4. Workload is memory-bound, not compute-bound. If the bottleneck is data movement (I/O, data prep), extra compute doesn't help. A100 is sufficient.


FAQ

How much faster is H100 really?

3x on tensor operations (FP16, BF16, TF32). For training: 3x throughput. For inference at batch size 32: 3x throughput. For inference at batch size 1: 2.3x faster (latency matters, not just throughput).

Can I mix A100 and H100 in one cluster?

For inference: yes, different model replicas. For training: no. Multi-GPU training requires homogeneous hardware (same tensor core count, same bandwidth). Mixing would cause synchronization overhead and unpredictable slowdowns.

Should I buy A100 or H100 used?

Used A100: $9,000-$12,000. Used H100: $15,000-$20,000. Breakeven on rental: A100 at 12 months continuous use, H100 at 12 months continuous use. If planning 18+ months of production serving, buy H100. Otherwise, rent.

What about H200 or B200 as alternative?

H200 (141GB HBM3e): released late 2025, now available at $3.59/hr (RunPod). 76% more memory, slightly higher throughput. Better for 70B+ models that don't fit in 80GB. Cost delta: $0.90/hr vs H100. Worth it if memory is constraint.

B200 (192GB): released Q1 2026, not yet widely available. $5.98/hr (RunPod). Overkill for most workloads. Wait for pricing to stabilize.

Is A100 still viable in 2026?

Yes, but aging. H100 is now the default. Use A100 if:

  • Budget-constrained
  • Running small models
  • Not requiring latest speed

New projects: default to H100. Legacy workloads: keep running on A100 until next refresh.

What's the power cost difference?

A100 SXM: 400W × $0.12/kWh = $0.048/hr power cost. H100 SXM: 700W × $0.12/kWh = $0.084/hr power cost.

Power cost is 2-4% of cloud rental cost. Negligible factor in decision.

Power Consumption and Data Center Implications

A100 and H100 differ significantly in power draw, affecting total cost of ownership.

A100 SXM: 400W TDP (thermal design power).

H100 SXM: 700W TDP (75% higher).

Data center cost impact: Power cost at $0.12/kWh (US average).

A100: 400W × $0.12 / 1000 = $0.048/kWh. Over 730 hours/month: $35/month per GPU in power cost.

H100: 700W × $0.12 / 1000 = $0.084/kWh. Over 730 hours/month: $61/month per GPU.

Power cost delta: $26/month per GPU. For 8-GPU cluster: $208/month extra (A100 cluster total cost: ~$6,860/month; H100: ~$15,747/month; power is 2-3% of total, negligible).

BUT: larger data centers care about power budget and cooling capacity. A 100-GPU training farm using A100 draws 40 kW. Using H100 draws 70 kW. Facility upgrades (cooling, power distribution) are non-linear (jump from 50 kW to 100 kW infrastructure might cost $500K). For hyperscaler deployments (1000+ GPU), power efficiency matters. For boutique cloud providers (10-100 GPU clusters), power is a rounding error in cloud pricing.

Decision: Power consumption is not a primary factor in GPU choice. Cloud pricing already reflects power costs. Mention it for completeness, but it doesn't change H100 vs A100 calculus.



Sources