Contents
- GPU vs CPU for AI: GPU vs CPU Architecture
- Core Design Philosophy
- CUDA Cores vs CPU Cores
- Memory Bandwidth and Throughput
- Parallelism: The Core Advantage
- Real Benchmark Data
- Training Performance Comparison
- Inference Performance Comparison
- Cloud GPU Pricing
- Power Consumption and Efficiency
- When CPUs Are Sufficient
- When GPUs Are Essential
- Cost Analysis: GPU vs CPU Rental
- FAQ
- Related Resources
- Sources
GPU vs CPU for AI: GPU vs CPU Architecture
GPU vs CPU for AI workloads comes down to parallelism: CPUs execute instructions sequentially, one after another. A modern CPU has 8-16 cores, each running at 3-5 GHz. Each core has dedicated cache (L1, L2, L3) and branch prediction logic. CPUs excel at general-purpose computing: decision trees, control flow, irregular memory access patterns.
GPUs (Graphics Processing Units) execute instructions in parallel across thousands of smaller cores. An NVIDIA H100 has 16,896 CUDA cores, each running at lower clock speed (1.5-1.8 GHz). Each core is simpler: less cache, no branch prediction. GPUs trade sequential speed for massive parallelism.
For AI, this architectural difference is decisive. Training Llama 2 70B requires matrix multiplications at 10^18 operations per second (exascale). A single CPU can't sustain 100 billion operations per second. A single H100 GPU hits 67 trillion FP32 operations per second (67 TFLOPS).
CPUs: optimized for low-latency instruction execution. GPUs: optimized for high-throughput parallel computation.
This distinction shapes everything about AI infrastructure.
Core Design Philosophy
CPU Design
CPUs prioritize latency. Run one task as fast as possible. Clock speeds are high (3-5 GHz). Branch prediction, out-of-order execution, and deep caches reduce pipeline stalls. A CPU cache miss costs 200+ clock cycles of wasted compute.
This matters for web servers, databases, and business logic where response time (latency) is critical. Serving a web request needs < 100ms turnaround. CPUs deliver this.
CPUs assume irregular, complex workflows. Code is likely to branch (if/else statements), access memory unpredictably, and call different functions. Cache and prediction logic handle this gracefully.
GPU Design
GPUs prioritize throughput. Run many simple tasks in parallel. Clock speeds are lower (1-2 GHz) because each core is simpler. No branch prediction. No large caches per core (shared cache only).
This works for AI because training is regular and predictable (matrix ops repeat). Throughput (tasks/second) matters more than latency (microseconds). Massive parallelism amortizes the per-core simplicity. Forward pass through a 100-layer transformer looks like: matrix multiply, activation, repeat 100 times. This predictability enables GPU efficiency.
CUDA Cores vs CPU Cores
What is a CUDA Core?
NVIDIA's CUDA (Compute Unified Device Architecture) core is the functional unit for floating-point math on GPUs. Each CUDA core can execute one FP32 (32-bit float) operation per clock cycle.
NVIDIA H100: 16,896 CUDA cores × 1.8 GHz = 67 trillion FP32 operations per second (peak, achieved).
Modern Intel/AMD CPU: 16 cores × 5 GHz = 2.56 trillion FP32 operations per second (peak, single-threaded: 320 billion ops/sec).
The GPU has 23x more peak throughput. But the comparison is deceptive.
CPU Cores Are Smarter
A CPU core is more capable. It can branch based on computed values (conditional logic), access memory in any pattern (cache handles some cost), speculate on branch direction (execute both paths, discard wrong one), and predict memory access patterns (prefetch useful data).
A CUDA core is simple. It executes one op, then waits for the next instruction. Complex logic, memory branching, and unpredictable access patterns slow GPUs down.
The Relevant Metric: FLOPs per Watt
H100: 67 TFLOPS (achieved) at 700W = 96 GFLOPS per watt. Intel Xeon CPU: 300 GFLOPS (peak FP32) at 250W = 1,200 MFLOPS per watt.
GPU is 72x more efficient for dense matrix math on real workloads.
For sparse operations, irregular access, or control flow, CPU efficiency improves because the GPU can't sustain parallelism.
Memory Bandwidth and Throughput
Memory bandwidth is the critical bottleneck in modern AI.
CPU Memory Bandwidth:
Intel Xeon Platinum (4-socket): roughly 400 GB/s aggregate bandwidth per socket. Shared across 32 cores. Per-core bandwidth: 400 GB/s / 32 = 12.5 GB/s per core.
A core running at 5 GHz processing FP32 floats needs 20 GB/s of memory bandwidth (each 32-bit float is 4 bytes; 5 GHz × 4 bytes = 20 GB/s). The CPU is memory-bound. The cache helps, but not by much for novel data.
GPU Memory Bandwidth:
NVIDIA H100 HBM3: 3.35 TB/s (3,350 GB/s). 16,896 CUDA cores. Per-core bandwidth: 3.35 TB/s / 16,896 = 198 GB/s per core.
This is 15.8x more bandwidth per core than CPU. The GPU doesn't stall waiting for memory.
Implication for Training:
Transformer training is memory-bound, not compute-bound. The forward pass multiplies Q (queries) by K (keys), producing attention weights. This matrix multiply is dense (compute-heavy). The backward pass is memory-heavy (gradient updates scatter across model weights).
H100's 3.35 TB/s enables fast gradient updates without stalling. CPUs at 400 GB/s total (not per-core) can't sustain the memory traffic. Training on CPU is 50-100x slower than GPU.
Parallelism: The Core Advantage
GPUs exploit data parallelism. Process 10,000 examples simultaneously.
Batch Processing on GPU:
Forward pass (batch size 256):
- Input: 256 × 4,096 tokens = 1M tokens
- Model: 70B parameters
- Operation: matrix multiply (1M tokens × 70B params)
- GPU parallelism: 16,896 cores each process ~4,000 multiply-accumulate operations
- Time: ~100 milliseconds
The same batch on CPU:
- 16 cores × 5 GHz each core can't parallelize matrix rows effectively
- Effective throughput: ~100x lower
- Time: ~10,000 milliseconds
This scaling is why GPUs are non-negotiable for training.
Inference Difference:
Single token inference (batch size 1):
- Input: 1 token
- Output: 1 next token
- Operation: 1 × 70B params = 70B multiply-accumulate operations
GPU: 70B ops / 67 TFLOPS = ~1,045 milliseconds CPU: 70B ops / 2.5 TFLOPS = ~28,000 milliseconds
Inference gap is 23x (not 100x) because batch size 1 doesn't parallelize well. But GPU still dominates for latency and throughput.
Real Benchmark Data
H100 vs A100 Performance
The H100 delivers approximately 3x the FP32 performance of the A100:
- A100 80GB SXM: 19.5 TFLOPS FP32
- H100 80GB SXM: 67 TFLOPS FP32
- Performance ratio: 3.4x advantage to H100
Why the difference: Hopper architecture packs 132 streaming multiprocessors with 128 FP32 CUDA cores each (16,896 total). Ampere's A100 has 108 SMs × 64 cores = 6,912 FP32 cores.
Practical Inference Throughput Impact
Independent testing shows H100 delivers 1.5-2x the inference throughput of A100 on large NLP models:
- A100: ~130 tokens/sec for 13B-70B models
- H100: ~250-300 tokens/sec for same models
- Cost implication: H100 serves nearly 2x the request volume per unit cost
Training Speed Benchmarks
Fine-tuning 7B model on 100K examples (as of March 2026):
- A100 GPU: 12-14 hours, $14-17 total
- CPU cluster (40-core Xeon): 800-1,000 hours, $400-500 total
- Speed ratio: 50-70x faster on GPU
Pre-training 70B model on 1 trillion tokens:
- 8x H100 cluster: ~1.1 days continuous, $516 total
- 128-core CPU cluster: ~267 days continuous, $12,840 total
- Speed ratio: 240x faster on GPU
Training Performance Comparison
Fine-Tuning a 7B Model (LoRA)
Hardware Setup:
- A100 GPU: 80GB VRAM, $1.19/hr (RunPod, as of March 2026)
- CPU (Xeon 40-core): 1TB RAM, $0.50/hr
Workload: Fine-tune Mistral 7B on 100K examples.
GPU (A100):
- Time: 12-14 hours
- Cost: $14-17
- Throughput: 7,000-8,000 examples/hour
CPU (40-core Xeon, no GPU):
- Time: 800-1,000 hours (estimate)
- Cost: $400-500
- Throughput: 100-125 examples/hour
GPU is 50-70x faster. Cost per fine-tuning job: $14-17 (GPU) vs $400-500 (CPU). CPU makes no sense here.
Pre-training Llama 2 70B
Hardware Setup:
- 8x H100 SXM cluster: $2.69/hr per GPU = $21.52/hr cluster (RunPod)
- 128x CPU cores (distributed): $2.00/hr
Workload: Pre-train 70B model from random weights on 1 trillion tokens.
GPU Cluster (8x H100):
- Training throughput: 1,350 samples/second per GPU × 8 = 10,800 samples/sec
- Time to 1T tokens: ~93,000 seconds = ~1.1 days continuous
- Cost: $21.52/hr × 24 = ~$516
CPU Cluster (128 cores):
- Training throughput: ~50 samples/second (distributed)
- Time to 1T tokens: ~23 million seconds = ~267 days continuous
- Cost: $2.00/hr × 6,420 hrs = ~$12,840
GPU is 240x faster. Cost: $516 (GPU) vs $12,840 (CPU). CPU is 25x more expensive.
Inference Performance Comparison
Batch Inference (Processing 1M Documents)
Task: Summarize 1M customer documents (500 tokens each) = 500M tokens total.
GPU (H100 PCIe, $1.99/hr):
- Throughput: 850 tokens/sec (batch size 32)
- Time: 500M tokens / 850 tok/sec = 588,000 seconds = 163 hours
- Cost: $325
CPU (Xeon 40-core):
- Throughput: 15 tokens/sec (parallelized, vectorized)
- Time: 500M tokens / 15 tok/sec = 33M seconds = 383 days
- Cost: $191/day × 383 = ~$73,000
GPU is 235x faster. GPU is 224x cheaper.
Real-Time Inference (Chat API)
Scenario: Serve ChatGPT-like chat to 1,000 concurrent users. Target: 100ms latency per response.
GPU Approach (1x H100):
- Batch 32 requests, infer in 50ms per batch
- Throughput: 640 requests/second
- Cost: $1.99/hr for 24/7 uptime = ~$48/day
CPU Approach (8-socket Xeon, 256 cores):
- Sequential inference: ~230ms per request
- Throughput: ~4 requests/second
- Latency: 230ms (violates 100ms target)
- Cost: ~$150/day for 24/7 uptime
GPU hits latency targets. CPUs don't. For real-time services, GPUs are essential.
Cloud GPU Pricing
Based on DeployBase API data (March 21, 2026):
RunPod (Single GPU, most competitive):
| GPU Model | VRAM | Price/hr |
|---|---|---|
| RTX 3090 | 24GB | $0.22 |
| RTX 4090 | 24GB | $0.34 |
| L40 | 48GB | $0.69 |
| A100 PCIe | 80GB | $1.19 |
| H100 PCIe | 80GB | $1.99 |
| H100 SXM | 80GB | $2.69 |
| H200 | 141GB | $3.59 |
| B200 | 192GB | $5.98 |
Lambda (Premium tier, higher specs):
| GPU Model | VRAM | Price/hr |
|---|---|---|
| A100 PCIe | 40GB | $1.48 |
| H100 PCIe | 80GB | $2.86 |
| H100 SXM | 80GB | $3.78 |
| B200 SXM | 192GB | $6.08 |
Cost per training job scenarios:
Small model fine-tuning (7B, 100K examples):
- 1x A100: 12-14 hours × $1.19 = $14-17
- Cost per example: $0.00014
Large model pre-training (70B, 1T tokens):
- 8x H100 cluster: 26.4 hours × $21.52 = $569
- Cost per billion tokens: $0.57
Power Consumption and Efficiency
Power Draw (Idle vs Full Load):
H100 GPU: 350W baseline, 700W peak CPU (Xeon 40-core): 100W baseline, 250W peak
GPU uses 2.8x more power at peak, but delivers 50-100x more throughput. Power efficiency (throughput per watt) heavily favors GPU.
Carbon Cost:
Training Llama 2 70B:
- GPU (8x H100, 1.1 days): ~200 kWh
- CPU (128 cores, 267 days): ~64,000 kWh
GPU uses 320x less energy. Environmental case for GPU is strong, despite higher per-hour consumption.
Cooling and Infrastructure:
GPUs generate intense localized heat. Require high-performance cooling (liquid cooling for clusters). Data centers housing GPUs need 5-10x power delivery vs CPU-only.
CPUs produce less heat. Fit standard data center infrastructure.
This trades cost (GPU clusters need specialized infrastructure) for efficiency (GPU compute is orders of magnitude more efficient).
When CPUs Are Sufficient
Small Models (< 1B parameters):
BLOOM 560M, Phi-2 2.7B, Gemma 2B can run on CPU for inference at acceptable speeds (5-10 tokens/sec). Fine-tuning is slow, but inference works.
Batch Size 1 with Extreme Latency Tolerance:
If the application allows 1-2 second latency, CPU inference is viable for models under 13B. Example: offline document summarization.
Custom Operations / Control Flow Heavy Workloads:
If the AI workload is mostly preprocessing, feature engineering, and conditional logic (not matrix multiply), CPUs are competitive. Example: rule-based classification with learned embeddings.
Very High Throughput Batch Inference:
If processing 100M examples offline and latency doesn't matter (24-48 hour processing window), CPU cost might be lower than GPU ($0.30/hr vs $1.99/hr) despite being 100x slower. Breakeven: 50+ days of continuous batch inference.
Development / Research (Prototyping):
Use CPU for code development, debugging, and quick experiments. Graduate to GPU once model is production-ready.
When GPUs Are Essential
Training Larger Models (> 7B parameters):
Essential. CPU training is not viable. H100 cluster breaks even vs. CPU in hours.
Fine-Tuning at Scale:
Tuning thousands of models or frequent retraining requires GPU. Cost amortizes across scale.
Real-Time Inference (< 500ms latency):
GPU is the only option for serving large models to concurrent users.
Inference Throughput > 100 tokens/second:
Batch inference at scale (documents, logs, daily batch jobs). GPU delivers 50-100x cost advantage.
Model Sizes 13B-405B:
A100 and H100 are standard for inference. Alternatives (CPU, TPU) have niche cases only.
Interactive Applications:
Web chat, code completion, document analysis. GPU latency (< 100ms) enables responsive UX.
Cost Analysis: GPU vs CPU Rental
Small Model Fine-Tuning (7B, 100K examples):
- GPU (1x A100): $14-17 total
- CPU cluster: $400-500 total
- GPU advantage: 25-30x cheaper
Large Model Pretraining (70B, 1T tokens):
- GPU (8x H100): $516 total
- CPU cluster: $12,840 total
- GPU advantage: 25x cheaper
Batch Inference (500M tokens, no latency SLA):
- GPU (1x H100): $325 total
- CPU cluster: $73,000 total
- GPU advantage: 224x cheaper
High-Frequency Inference (1B tokens monthly):
- GPU (1x H100): $19,900/month ($1.99/hr continuous)
- CPU cluster: 8x more expensive with lower quality
- GPU advantage: Enables real-time services that CPU cannot
Conclusion: GPU cost is lower across all timescales above toy models. CPUs only win if training models < 3B and accepting > 1-second inference latency. For production AI, GPUs are the only economically viable choice.
FAQ
Can I train models on CPU?
Theoretically yes. Practically no. A 7B model takes months on CPU. Costs $10,000+. Not viable for production. Only for research on toy models (< 100M parameters).
Why are GPU cores simpler than CPU cores?
Simplicity enables parallelism. A CPU core has 15-30 billion transistors. A CUDA core has ~1 billion. 16,896 simple cores fit on one die, vs 16 complex CPU cores. The math: massive parallelism beats sequential speed for AI.
Do GPUs have enough memory?
H100 tops out at 80GB. A 70B parameter model quantized to 4-bit needs ~35GB. It fits. 405B models need 8x H100 (640GB aggregate). Memory is the growth bottleneck. Newer GPUs (H200 at 141GB, B200 at 192GB) address this.
Is GPU inference cheaper than CPU inference?
Yes, by 100-1000x on cost-per-token once model size > 13B. Smaller models (< 7B), CPU inference is marginally competitive if optimized (vectorization, quantization). But GPU still wins on latency.
What about TPUs?
Google's TPUs are optimized for dense matrix multiply (like GPUs). Tensor Processing Units are highly specialized. Less flexible than GPUs but very efficient for transformer training. Similar conclusions apply: TPU beats CPU by 50-100x. TPU vs GPU is a cost-per-training-job trade-off, not a CPU vs GPU question.
When will CPUs catch up to GPUs for AI?
CPU companies (Intel, AMD) are investing in AI extensions (AVX-512, AVX3, VNNI). Throughput is increasing. But architectural constraints (sequential design, small core count) make it unlikely CPUs will match GPU efficiency. CPUs might 2-3x improve, GPUs will advance too. Gap persists for the next 3-5 years.