Contents
- H200 on RunPod: Overview
- RunPod H200 Pricing
- How to Rent H200 on RunPod
- H200 vs H100 on RunPod
- FAQ
- Related Resources
- Sources
H200 on RunPod: Overview
H200 on runpod is $3.59/hour. 141GB HBM3e memory (more than any single GPU). Builds on H100 architecture for LLM inference at scale.
FP8 hits 3,958 TFLOPS. FP32 hits 67 TFLOPS. Bandwidth is 4.8 TB/s. That's enough for inference of Llama 70B or Mistral 8x22B with minimal quantization.
Long-context workloads love this: 100K+ token sequences fit comfortably.
RunPod H200 Pricing
RunPod's H200 pricing as of March 2026 sits at $3.59 per hour for on-demand rental. This pricing reflects the H200's premium status as NVIDIA's latest flagship processor. The rate assumes standard US region availability; international regions may vary by 15 to 30 percent.
Compared to the H100 PCIe at $1.99 per hour on RunPod, the H200 costs 80 percent more per hour. However, the additional 41GB of memory often eliminates the need for multiple H100s, potentially reducing overall costs for large model workloads.
RunPod charges by the minute, with a one-minute minimum billing increment. Users pay only for actual runtime, with no reserved capacity charges or standby fees. Network egress to the public internet costs $0.10 per GB beyond the first GB per month per instance.
Long-running inference services can take advantage of RunPod's spot pricing, typically offering 50 to 70 percent discounts compared to on-demand rates. Spot instances suit workloads that tolerate occasional interruptions.
How to Rent H200 on RunPod
Renting an H200 on RunPod requires just a few clicks. First, sign up at runpod.io and verify email address. Add payment method through the account settings page.
handle to the RunPod console and select "Create Pod." Search for "H200" in the available GPU filter. The interface displays hourly rates for on-demand and spot instances side by side.
Select preferred region, machine type, and volume configuration. RunPod offers persistent volumes for storing models and datasets separately from the computing instance. A 50GB volume for model storage is typical for most inference workloads.
Choose a container image. RunPod provides pre-built images for common frameworks like vLLM, Ollama, and HuggingFace TGI. Custom images can be specified from Docker Hub or private registries.
After creating the pod, RunPod assigns a public IP address and port number for SSH access. Download the connection script or connect directly using SSH with the provided key. Start the inference server through the pod's command terminal.
Connect to the inference endpoint using standard REST APIs. Most users proxy requests through a local application or integrate directly into their pipeline.
H200 vs H100 on RunPod
The decision between H200 and H100 depends on memory requirements and budget constraints. The H100 at $1.99 per hour suits smaller models or heavily quantized variants. Most models under 20 billion parameters run efficiently on H100 with proper quantization.
H200's 141GB memory supports full-precision inference for larger models without quantization. This preservation of model precision often improves output quality for creative tasks and complex reasoning. The trade-off is higher hourly cost.
For batch inference with moderate batch sizes, H100 often provides better throughput-per-dollar. Multiple H100s can be orchestrated across RunPod's network for distributed inference. H200 shines for single-GPU performance on massive models.
H100 detailed specifications and benchmarks provide additional context for comparison. H200 specifications outline the exact performance differences between the two processors.
FAQ
Can I run Llama 3 70B on H200 without quantization? Yes, Llama 3 70B fits entirely in H200's 141GB memory with room for batch processing. Running without quantization preserves model accuracy.
What inference frameworks work on RunPod H200? vLLM, HuggingFace Text Generation Inference, and Ollama all support H200. These frameworks handle optimization and batching automatically.
How does H200 compare to renting multiple H100s? A single H200 often outperforms two H100s for large model inference because model weights stay resident in memory rather than being split across GPUs.
What is RunPod's uptime SLA for H200 instances? RunPod does not publish formal SLAs for spot or on-demand instances. Production deployments should use monitoring and failover strategies.
Can I use H200 for training? While possible, H200's primary design targets inference. Training on H200 works but offers no significant advantage over H100.
Related Resources
H100 RunPod Pricing and Availability covers the previous-generation flagship GPU.
Inference Optimization Strategies explains techniques to maximize throughput on any GPU.
GPU Pricing Comparison Guide shows H200 rates across all providers including Lambda and CoreWeave.