Contents
- H200 Price Overview
- Cloud Provider Pricing
- When H200 Makes Sense vs H100
- H200 Specifications and Architecture
- H200 vs H100 Performance
- H200 vs H100 Pricing
- Form Factor Variants
- Spot vs On-Demand Pricing
- Single GPU vs Multi-GPU Clusters
- Monthly and Annual Costs
- Cost Per Gigabyte
- Use Case Cost Estimates
- H200 Availability
- Total Cost of Ownership
- Nvidia H200 Price: Pricing Outlook Through 2026
- FAQ
- Related Resources
- Sources
H200 Price Overview
NVIDIA H200 cloud pricing ranges from $3.59 per GPU-hour on RunPod to $50.44 per GPU-hour on CoreWeave for 8-GPU clusters, as of March 2026. The massive price range reflects different deployment models: single-GPU rentals versus multi-GPU production clusters with NVLink interconnect and managed infrastructure.
The H200 carries 141GB HBM3e memory with 4.8 TB/s bandwidth. For teams that hit the 80GB ceiling on the H100, the H200 doubles capacity without proportional cost increases. That makes it cost-effective for inference workloads and fine-tuning that previously required model parallelism across multiple GPUs. For detailed specifications and comparisons with other GPUs, see the GPU specifications guide.
Cloud Provider Pricing
| Deployment | Provider | $/GPU-hr | Total VRAM | Form Factor |
|---|---|---|---|---|
| Single GPU | RunPod H200 | $3.59 | 141GB | 1x |
| Single GPU | Lambda GH200 | $1.99 | 141GB | 1x |
| 8-GPU Cluster | CoreWeave H200 | $50.44 | 1,128GB | 8x |
The Lambda GH200 at $1.99/hr is technically a different chip (NVIDIA's Grace Hopper CPU-GPU combo), but it occupies the same market slot for inference. The H200 offers 141GB vs GH200's 141GB HBM3e, with H200 at 4.8 TB/s vs GH200's ~4.0 TB/s GPU-side bandwidth (GH200 also benefits from its Grace CPU's large LPDDR5X memory pool). However, GH200 availability is limited, and workload optimization for Grace CPU cores is a learning curve.
RunPod's single H200 at $3.59/hr offers straightforward pricing. Rent by the hour, no multi-year commitments, no cluster complexity. For batch processing, fine-tuning, and single-instance inference, that's the entry point across major cloud providers as of March 2026.
When H200 Makes Sense vs H100
The decision tree for H200 vs H100:
Use H100 if:
- The models are under 70B parameters
- Batch size 1-2 fits the use case
- Budget is the primary constraint ($0.33 more per hour is significant at scale)
- Teams are doing inference with short sequences (<2K tokens)
Use H200 if:
- Teams are currently hitting OOM errors on H100
- Teams need batch sizes > 4 for standard 7B models
- The models exceed 70B parameters in full precision
- Teams are doing long-context inference (2K+ tokens per request)
- Teams are fine-tuning large models and want to avoid multi-GPU synchronization
Use H200 if cost per token doesn't matter more than latency or throughput: The H200's real value is eliminating complexity (no multi-GPU setup, no model parallelism tuning). For cost-per-token optimization on simple workloads, H100 is still the better value.
H200 Specifications and Architecture
The NVIDIA H200 is NVIDIA's answer to the memory ceiling. The H100 maxes out at 80GB. Real-world LLM workloads are pushing past that: full-precision 70B models, long-context retrieval-augmented generation, and large batch inference all hit the 80GB wall.
The H200 specification sheet as of March 2026:
- Memory: 141GB HBM3e
- Bandwidth: 4.8 TB/s
- Compute: 3,958 TFLOPS FP8 (with sparsity), 1,979 TFLOPS BF16 (with sparsity)
- NVLink: 900 GB/s per GPU (NVLink 4.0)
- TDP: 700W (SXM variant)
- Form Factor: Single GPU card (fits standard PCIe slots)
The 141GB is the key feature. HBM3e (next-generation high bandwidth memory) is denser and more power-efficient than the H100's HBM3. The jump from 80GB to 141GB is +76%, not a doubling, but enough to eliminate most model-parallelism headaches for models under 100B parameters.
The bandwidth increase (4.8 TB/s vs H100's 3.35 TB/s) helps less than the memory increase. Bandwidth matters for long-sequence inference (4K+ token context). For standard batch processing, the memory is the constraint.
H200 vs H100 Performance
Inference Throughput
On LLM inference (Mistral 7B, batch size 4):
- H100 (80GB): ~180 tokens/sec
- H200 (141GB): ~185 tokens/sec
Throughput gain: 3% (negligible).
Batch Size Support
- H100 PCIe: batch size 2-4 on 7B models, batch size 1-2 on 13B models
- H200: batch size 4-8 on 7B models, batch size 2-4 on 13B models
The memory advantage translates to batch size gains, not single-request speed. This matters for inference serving where throughput (total tokens/second across all users) is the optimization target.
Training Speed
Training a 7B parameter model with full precision:
- H100 (80GB): requires 2 GPUs with model parallelism
- H200 (141GB): fits on single GPU without parallelism
Training on a single H200 vs two H100s with model parallelism synchronization overhead: H200 is 10-15% faster wall-clock time due to eliminated synchronization overhead.
H200 vs H100 Pricing
H200 runs roughly 80% more expensive than H100 on the same provider. At RunPod: H100 PCIe $1.99/hr vs H200 $3.59/hr. The premium is justified when 141GB solves a real memory constraint. When 80GB is sufficient, H100 is the better value.
Form Factor Variants
H200 ships in SXM5 (primary datacenter form factor) and PCIe variants. No NVL variant (dual H200 die) has been announced as of March 2026.
- Single GPU inference: H200 PCIe available from boutique providers
- Multi-GPU training: H200 SXM5 available on CoreWeave in 8-GPU NVLink clusters
- Large batch inference: H200 single GPU beats H100 single GPU on memory; distributed 8x H200 SXM clusters offer full NVLink interconnect for training
Spot vs On-Demand Pricing
Spot pricing for H200 is not yet standardized across cloud providers. RunPod has not published spot discount rates. AWS and GCP don't offer H200 yet.
When spot pricing lands (expected Q2-Q3 2026), expect 30-50% discounts similar to H100 spot rates.
Provisional estimate (not confirmed): H200 spot on RunPod might reach $1.80-$2.00/hr if they follow H100's discount pattern.
Single GPU vs Multi-GPU Clusters
Single-GPU Economics
RunPod H200 at $3.59/hr for 730 monthly hours (24/7 continuous operation):
- Monthly: $2,621
- Annual: $31,450
- 3-year: $94,350
This covers inference serving, fine-tuning with batch size 1-4, and research experiments. No orchestration complexity. No NVLink coordination overhead. Just rent GPU time and pay on exit.
Multi-GPU Cluster Economics
CoreWeave 8x H200 at $50.44/hr for 730 monthly hours:
- Monthly: $36,821
- Annual: $441,852
- 3-year: $1,325,556
This is for distributed training. Eight H200 GPUs connected via NVLink at 900 GB/s per GPU. The aggregate memory (1,128GB) allows training 70B+ parameter models without model parallelism reshuffling the model across GPUs. The bandwidth lets all eight GPUs saturate interconnect.
The per-GPU cost ($50.44 / 8 = $6.31/GPU-hr) looks expensive versus RunPod's single-GPU price, but that misleads. CoreWeave's rate includes NVLink, power conditioning, SLA guarantees, and load-balanced networking. A single GPU without that support is worth $3.59. Eight coordinated GPUs with NVLink and power redundancy are worth significantly more.
Monthly and Annual Costs
Single H200 (RunPod)
| Hours/Month | Monthly Cost | Annual Cost |
|---|---|---|
| 100 hrs | $359 | $4,308 |
| 250 hrs | $898 | $10,740 |
| 730 hrs (24/7) | $2,621 | $31,450 |
8x H200 Cluster (CoreWeave)
| Hours/Month | Monthly Cost | Annual Cost |
|---|---|---|
| 100 hrs | $5,044 | $60,528 |
| 250 hrs | $12,610 | $151,320 |
| 730 hrs (24/7) | $36,821 | $441,852 |
Cost Per Gigabyte
H200 memory premium versus older GPUs:
| GPU | Memory | $/hr | Cost/GB |
|---|---|---|---|
| H100 PCIe (80GB) | 80GB | $1.99 | $0.025/GB |
| H200 (141GB) | 141GB | $3.59 | $0.025/GB |
| GH200 (141GB) | 141GB | $1.99 | $0.014/GB |
The H200 and H100 have nearly identical cost per gigabyte on RunPod. The GH200 is slightly cheaper per GB, but availability is sparse. For teams evaluating bang-for-buck on pure memory capacity, the H200 doesn't command a premium per gigabyte; the extra capacity costs proportionally.
Use Case Cost Estimates
Fine-Tuning 70B Mistral (Full Training, 20 hours)
RunPod H200 single GPU: 20 hours × $3.59 = $71.80
Full fine-tuning a 70B model on H100 (80GB) requires model parallelism across 2-4 GPUs. On H200 (141GB), a single GPU usually handles it with medium batch sizes. One H200 for 20 hours beats two H100s at $5.38/hr for 10 hours ($53.80), and finishes in half the wall-clock time.
Inference Serving (2M tokens/day, ~4 hrs GPU/day)
RunPod H200 at $3.59/hr: roughly 4 hrs × $3.59 × 30 days = $431/month
The 141GB capacity handles any open-source LLM under 70B parameters at batch size 4-8. Quantized 70B models run at batch 16+. For long-running inference that previously required model parallelism, the H200 is a cost win despite the higher hourly rate.
Batch Workload (Document Processing, 500 tasks)
RunPod H200 single GPU, estimated 40 hours of processing: 40 × $3.59 = $143.60
Compare to running the same job on H100 split across two GPUs: 40 hours × ($2.69 × 2) = $215.20. The H200 saves $71.60 on this batch, a 33% reduction in cost.
H200 Availability
H200 entered production in Q4 2025 and ramped through early 2026. Availability across major cloud providers as of March 2026:
- RunPod: Full availability, all regions
- Lambda: GH200 only (similar specs, different CPU), limited availability
- CoreWeave: Available in select regions, 8-GPU clusters
- AWS: Expected Q2 2026
- Google Cloud: Expected Q2-Q3 2026
- Azure: Expected Q3 2026
Boutique providers (ORI, Nebius, Hyperstack) are not yet advertising H200 on public pricing pages. Expect rollout through Q2-Q3 2026.
Total Cost of Ownership
Single H200 (3-Year Deployment)
Purchase cost (H200 PCIe): $35,000 Power (500W × 24/7 × 1,095 days at $0.12/kWh): $5,256 Cooling/Facility (estimated $5,000/year): $15,000 Depreciation (assume 3-year total loss): $0 (sunk cost)
Total 3-year cost: $55,256 Equivalent monthly cloud cost: $1,535 (continuous 24/7 operation)
At RunPod's $3.59/hr:
- 730 hrs/month × $3.59 × 36 months = $94,919
- TCO crossover: 26 months of continuous operation
For teams renting less than 26 months or using less than 24/7, cloud rental is cheaper. For teams planning 3+ years, ownership approaches parity when considering power/cooling.
Nvidia H200 Price: Pricing Outlook Through 2026
H100 and H200 pricing is shifting through early 2026. Predictions for Q2-Q4 2026:
H200 availability: Ramp from 3 providers (RunPod, Lambda, CoreWeave) to 10+ by Q3 2026. Increased supply will drive modest price competition (expect 5-10% decrease by Q4).
H100 pricing pressure: As H200 adoption grows, H100 demand drops. Expect H100 rates to drift down 10-20% through 2026. This isn't because H100s become cheaper to operate; it's because older-generation supply sits unsold.
Major cloud provider response: AWS and GCP have announced H200 support for Q2 2026. Pricing will likely be 10-15% premium to boutique providers (RunPod, Lambda), consistent with their historical pattern. Expect AWS H200 at $4-4.50/hr.
B200 impact: NVIDIA's B200 (192GB HBM3e, already available in cloud from Q1 2026) will further pressure H200 and H100 pricing. Teams with budgets will wait for B200 or negotiate deeper H100 discounts.
FAQ
Is H200 worth the premium over H100? If batch sizes or sequence lengths exceed 80GB, yes. The H200 avoids model parallelism overhead and eliminates multi-GPU synchronization. For single-GPU workloads hitting 80GB limits, the 33% cost premium is worth it. For workloads that fit comfortably on H100, stick with H100.
When will H200 be available on AWS and Google Cloud? AWS expects to launch by June 2026. Google Cloud by September 2026. Early cloud availability is limited to RunPod and CoreWeave (March 2026). Expect major provider rollout through Q2-Q3 2026.
Can I use H200 for training? Yes, but not cost-efficiently as a single GPU. Full H200 clusters (8 GPUs) are available on CoreWeave. The NVLink interconnect makes distributed training possible, but at $50.44/hr for the cluster, it's roughly equivalent to H100 clusters. Use H200 clusters only if you need the memory capacity for model parallelism at 70B+ scale.
How much faster is H200 compared to H100 for my workload? For inference: 3-5% faster wall-clock time due to higher bandwidth. The real win is batching: H200 handles larger batches without OOM errors. For training: 10-15% faster if it eliminates model parallelism. Full fine-tuning of 70B models that required 2 H100s now fit in 1 H200 without synchronization overhead. For memory-constrained tasks: Up to 50% faster because you're no longer shuffling data between GPU and system RAM.
Will H200 price drop in 2026? Unlikely to drop below $3.00/hr on single-GPU rentals. More likely: price stays stable while hyperscalers (AWS, GCP) enter the market at $3.50-4.00/hr in Q2-Q3. Competitive pressure will be modest. Core drivers (NVIDIA supply, datacenter costs) don't support sharp discounts.
What's the return on investment for buying H200 hardware? Breakeven on ownership is 20-26 months of continuous operation at cloud prices. For teams planning 2+ years of steady inference or training, owning becomes cost-competitive. For projects under 18 months, renting is cheaper and carries less capital risk.
How does H200 compare to L40S? L40S is inference-focused with 48GB memory. H200 is general-purpose with 141GB. For dense transformer serving, L40S is cheaper per GPU-hour. For fine-tuning, batch processing, and anything requiring more than 48GB, H200 is the right choice.
What's the H200 memory bandwidth advantage? 4.8 TB/s vs H100's 3.35 TB/s. That's 43% higher throughput. In practice, this matters for long-sequence inference (4K+ token context) where memory bandwidth becomes the bottleneck. For standard batch processing and short sequences, the difference is unnoticeable.
Related Resources
Sources
- NVIDIA H200 Product Brief
- NVIDIA H200 Datasheet
- RunPod Pricing
- Lambda Cloud H200 Pricing
- CoreWeave Pricing
- DeployBase GPU Pricing Tracker (H200 rates observed March 21, 2026)