Deploy LLM Production: Complete Architecture and Optimization Guide
Deploying language models to production requires careful attention to architecture, monitoring, and cost optimization. This guide covers the deploy llm production pipeline from choosing hardware through implementing auto-scaling, with focus on vLLM infrastructure patterns that handle production traffic characteristics reliably and cost-effectively. As of March 2026, vLLM remains the dominant open-source serving framework, with production deployments ranging from small startups to major companies.
Architecture Overview: Request Flow and Scaling Tiers
A production LLM deployment consists of multiple layers working in concert:
- Load balancer: Routes requests across multiple vLLM instances
- vLLM instances: Serve inference requests with efficient batch processing
- Monitoring layer: Collects metrics on throughput, latency, errors
- Auto-scaling controller: Adjusts instance count based on load
For a moderate-scale deployment serving 1,000 requests/minute baseline traffic:
- Load balancer (stateless, replicable): 2x instances
- vLLM inference (GPU-intensive): 4-6x H100 instances
- Monitoring (lightweight): 1x instance
- Auto-scaling controller: 1x instance
This architecture handles typical traffic spikes (2-3x baseline) without dropping requests while avoiding excessive idle GPU cost during low-traffic periods.
Selecting GPU Hardware for Production Deployments
GPU selection involves tradeoff between throughput per GPU, cost, and availability.
Llama 4 Scout (17B active) inference requirements:
- Model weights: 34GB (FP16)
- KV cache (batch 256): 102GB
- Total: 136GB required memory
Available GPU options:
-
2x H100 SXM (80GB each, 160GB total): $5.38/hour RunPod, $7.56/hour Lambda
- Handles Llama 4 with batch size 256 (tensor parallel across 2 GPUs)
- Achieves 3,000-3,200 tokens/second
- Cost-per-token: $0.000469 (RunPod)
-
A100 80GB: $1.39/hour RunPod (SXM), handles smaller batches with model weights + limited KV cache
- Handles Llama 4 with batch size 32-64 (limited by KV cache)
- Achieves 800-1,200 tokens/second
- Cost-per-token: $0.000290
-
B200: $5.98/hour RunPod, $6.08/hour Lambda
- Handles Llama 4 with batch size 256-320
- Achieves 3,800-4,200 tokens/second
- Cost-per-token: $0.000357
For cost optimization, pair A100s with H100s: route short requests (under 200 output tokens) to A100 clusters, long-form generation to H100/B200 clusters. This mixed approach typically reduces overall cost-per-token by 25-35% while meeting latency targets.
vLLM Configuration for Production Workloads
Standard vLLM production configuration:
python -m vllm.entrypoints.openai_compatible_server \
--model meta-llama/Llama-4-scout \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.85 \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code
Key parameters:
- tensor-parallel-size: Number of GPUs per vLLM instance (1 for H100, supports >1 for model parallelism)
- gpu-memory-utilization: Balance between throughput (higher values) and latency predictability (lower values)
- max-num-batched-tokens: Limits memory per batch; lower values improve latency for long sequences
- max-num-seqs: Maximum concurrent requests per instance
- enable-prefix-caching: Cache prompts across requests (saves compute for repeated prefixes)
- enable-chunked-prefill: Process prefill in chunks (improves latency variance)
For 1,000 requests/minute baseline traffic:
- Each H100 vLLM instance handles approximately 180-200 requests/minute
- Baseline deployment: 5-6 H100 instances
Peak traffic (3x baseline):
- Required: 15-18 H100 instances
- Auto-scaling policy: scale up when average latency exceeds 500ms, scale down when below 250ms
Load Balancing and Request Routing Patterns
vLLM instances are stateless; requests route to any available instance. Implementation through reverse proxies (NGINX, HAProxy) or cloud load balancers (AWS ALB, GCP Cloud Load Balancer):
NGINX configuration:
upstream vllm_backend {
server vllm1:8000;
server vllm2:8000;
server vllm3:8000;
server vllm4:8000;
least_conn;
}
server {
listen 80;
location /v1/completions {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_buffering off;
proxy_request_buffering off;
}
}
The load balancer distributes requests round-robin across available instances. For better performance, implement least-connections routing (sends new requests to instance with fewest active connections) rather than simple round-robin.
Health checking is critical. vLLM instances become unresponsive if memory fragmentation occurs, even though the process remains running. Health checks should verify:
- Process is responding to API calls
- Average latency is below threshold (500ms for health, 1000ms indicates degradation)
- Memory utilization is stable
Remove instances exceeding latency thresholds from the load balancer immediately. A degraded instance slows the entire cluster more than its absence.
Monitoring and Observability Implementation
Key metrics to track:
- Throughput: Tokens generated per second, requests per minute
- Latency: p50, p90, p99 latency percentiles
- GPU utilization: Percentage of GPU compute utilized
- Memory utilization: VRAM used vs. available
- Error rates: Failed requests, timeout counts
- Queue depth: Requests waiting for vLLM processing
Example metrics collection with Prometheus:
from prometheus_client import Counter, Histogram, Gauge
import vllm
tokens_generated = Counter('vllm_tokens_generated', 'Total tokens generated')
request_latency = Histogram('vllm_request_latency_seconds', 'Request latency', buckets=[0.1, 0.5, 1.0, 2.0])
gpu_memory = Gauge('vllm_gpu_memory_used', 'GPU memory used in bytes')
@app.post("/v1/completions")
def generate(request):
start = time.time()
result = vllm_engine.generate(request.prompt)
latency = time.time() - start
tokens_generated.inc(len(result.tokens))
request_latency.observe(latency)
gpu_memory.set(vllm_engine.get_gpu_memory_usage())
return result
Set up alerts:
- P99 latency > 1 second: Indicates overload or degradation
- Error rate > 1%: Something is failing, investigate immediately
- GPU memory utilization > 90%: Risk of OOM errors
Dashboards should display current load (requests/minute), latency distribution, instance health status, and cost metrics (GPU hours used, cost-per-1M-tokens).
Auto-Scaling Implementation Strategies
Kubernetes-based auto-scaling using Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 5
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: vllm_request_latency_seconds
target:
type: AverageValue
averageValue: "0.5"
This configuration:
- Maintains 5-20 vLLM instances
- Scales up when average latency exceeds 500ms or CPU utilization exceeds 70%
- Scales down when latency drops below 300ms and CPU below 50%
- Prevents rapid scaling cycles through cooldown periods (default 3 minutes)
For non-Kubernetes deployments (standalone cloud instances), implement scaling through API calls to cloud providers.
Scaling strategy matters. Aggressive scaling (adding instances at smallest load increase) prevents latency spikes but wastes GPU cost during traffic volatility. Conservative scaling (waiting for sustained high load) reduces cost but risks timeout errors. Optimal strategy typically uses 60-90 second smoothing windows: scale only if high latency persists for 60+ seconds.
Cost Optimization at Production Scale
For production deployments processing billions of tokens monthly, cost optimization becomes critical.
Baseline cost (serving 10B tokens/month):
- H100-only approach: $5,000-6,000/month GPU cost
- Mixed A100/H100: $2,500-3,200/month GPU cost
- Spot instances (cloud provider temporary capacity at 60-70% discount): $1,500-2,100/month
Implementing spot instance strategy:
- 70% of capacity on spot instances (interruption risk acceptable for non-critical workloads)
- 30% on reserved instances (guaranteed capacity for critical customers)
- When spot instance is interrupted, route traffic to reserved instances, scale up additional reserved capacity
This mixed approach reduces cost by 40-50% while maintaining SLA compliance. The downside: infrastructure complexity increases substantially.
Additional optimizations:
- Quantization: Running Llama 4 in int4 format reduces memory by 75%, enabling batch size 512+ on A100, increasing cost-per-token efficiency by 40%
- Model caching: Cache generated outputs for repeated queries (7-12% of typical traffic is repeated)
- Request batching: Accumulate requests for 50-100ms before processing (improves batching efficiency by 15-20%)
Deployment Checklist and Pre-Production Validation
Before deploying to production:
- Load test: Generate 2-3x expected peak traffic, verify all instances remain healthy
- Failover test: Kill instances during peak traffic, verify load balancer reroutes correctly
- Monitoring validation: Confirm all metrics are collected and alerting works
- Cost tracking: Set up budget alerts at $100 overage threshold
- Documentation: Document instance configuration, scaling policies, and rollback procedures
Expected cost for 1,000 requests/minute deployment (16-hour active window daily):
- H100 instances (6 baseline, 3 reserved peak): $350-450/month
- Data transfer: $50-100/month
- Monitoring/observability: $30-50/month
- Total: $430-600/month
For comparison, using managed services (AWS Bedrock, Google Vertex AI):
- Variable pricing based on requests
- Typical cost: $800-1,200/month for equivalent throughput
- Trade-off: Reduced operational overhead vs. 50-100% cost premium
Advanced Configurations for Specific Workloads
Long-Context Inference (4K-8K tokens)
Long context requires more KV cache memory than short-context inference. Adjust configurations:
- Lower
max-num-seqsto 64 (was 256) - Reduce
gpu-memory-utilizationto 0.75 (was 0.85) - Expect 30-40% throughput reduction
- Consider A100 clusters for cost-efficiency on long sequences
Multi-turn Conversation (stateful sessions)
Conversation history accumulates. Optimize for repeated context:
- Enable prefix caching aggressively
- Use persistent session pools
- Cache conversation prefixes across requests
- Expect 20-30% reduction in compute per turn after first response
Real-time Streaming Applications
Token-by-token streaming requires chunked processing:
- Enable chunked prefill
- Reduce
max-num-batched-tokensto 4096 - Implement streaming response handlers
- Monitor token-level latencies, not just request latency
Disaster Recovery and Failover Scenarios
Instance Failure Recovery
vLLM instances become unresponsive when GPU memory fragmentation causes OOM. Implement:
- Health checks every 10 seconds
- Automatic removal from load balancer on health failure
- Instance replacement triggers new startup
- Checkpoint recovery from persistent storage
Load Balancer Failure
If load balancer fails, traffic stops immediately. Mitigate:
- Use cloud provider load balancers with automatic failover (AWS ALB, GCP Cloud LB)
- Implement backup load balancer in secondary region
- DNS failover to alternate regions
- Test failover quarterly
Regional Outage
Prepare for entire region going down:
- Maintain read replicas in secondary regions
- Traffic automatically routes to backup region
- Latency impact acceptable for most applications
- Cost is 50% premium for regional redundancy
Benchmarking The Deployment
Before going live, establish baseline metrics:
- Throughput Test: Run 100 concurrent requests, measure tokens/second. Aim for 3K+ tokens/second on H100.
- Latency Test: Measure p50, p90, p99 latencies under load. P99 should be under 800ms for typical requests.
- Failover Test: Kill an instance mid-request, verify graceful handling. Expect <2 second interruption.
- Scaling Test: Simulate 3x traffic growth, confirm auto-scaling works. Scaling should trigger within 30 seconds.
- Cost Test: Run 24 hours, measure actual GPU hours consumed. Compare to projections.
- Memory Test: Monitor GPU memory fragmentation over 8+ hours. Ensure no gradual degradation.
- Error Recovery Test: Inject errors (request timeouts, model load failures), verify graceful degradation.
Expected results for properly configured H100 setup:
- Throughput: 3,000-3,500 tokens/second sustained
- P99 latency: 500-800ms for 1K input tokens
- P99 latency: 1,200-1,500ms for 4K input tokens
- Failover recovery: <2 second request interruption
- Scaling time: 2-4 minutes from trigger to ready
- Cost: $3-4 per million tokens (including scaling overhead)
- Memory utilization: 70-85% sustained (avoid >90%)
- Error rate: <0.1% (aim for near-zero)
Performance below these benchmarks indicates configuration problems. Investigate before going live.
Production Checklist Before Deployment
Pre-Launch Validation:
- Load test at 2x expected peak traffic (don't just assume scaling works)
- Verify monitoring dashboards display all metrics in real-time
- Confirm alerting works (test by triggering alert manually)
- Validate cost tracking and budget alerts (set at 80% of daily budget)
- Document runbooks for common issues (stuck instance, memory leak, etc)
- Create on-call rotation and escalation path (who gets paged when things break)
- Test rollback procedure for new model versions (practice before production)
- Verify data backup procedures work (actually restore from backup to confirm)
- Set up customer status page (communicate outages proactively)
- Document SLA commitments developers're making (write it down before launch)
Day 1 Monitoring:
- P99 latency trending and stable (expect some variance first day)
- GPU utilization between 60-80% (not spiking to 95%+)
- Zero error rate (aim for perfection on day 1; any errors get investigated)
- Cost tracking matches projections (if off by >20%, investigate why)
- No unexpected scaling events (scaling should match traffic patterns)
- Customer-reported issues track closely to metrics (gut check reality vs dashboards)
- Instance health checks passing consistently (all instances responsive)
Week 1 Validation:
- Latency stable across traffic variations (patterns repeat, not anomalies)
- Scaling events correlate with traffic changes (no phantom scaling)
- No provider issues (GPU node failures, network problems)
- Cost trending within 10% of projections (if higher, optimize immediately)
- Customer satisfaction metrics acceptable (support feedback positive)
- Model output quality consistent (sample outputs spot-check)
- No gradual performance degradation (memory leaks, etc)
Cost Optimization at Different Scales
For 1B tokens/month:
- Use 1-2 A100 instances on-demand
- Cost: $350-500/month
- Simple auto-scaling not critical
For 10B tokens/month:
- Use mixed A100/H100 infrastructure
- Cost: $2,500-3,500/month
- Implement basic auto-scaling
- Consider 30% spot capacity
For 100B+ tokens/month:
- Use dedicated infrastructure with reserved instances
- Cost: $15,000-25,000/month
- Optimize through quantization and batching
- Negotiate volume pricing with providers
Recommendation: Starting The Production Deployment
Deploy through RunPod or Lambda for initial testing: reserve 2x H100 instances to establish baseline performance metrics on the actual traffic patterns. Measure achieved throughput, latency distribution, and GPU utilization.
After 1-2 weeks of data collection, make permanent infrastructure decisions:
- If peak latency consistently under 500ms on 2 H100s, that's the baseline sizing
- If GPU memory utilization averages 70%+, consider mixing in A100s for cost optimization
- If error rates exceed 0.1%, investigate vLLM configuration or load balancer health
Most teams find that 50% larger baseline capacity than theoretical minimum provides better economics through reduced scaling frequency and improved latency percentiles. The additional GPU cost is offset by fewer timeouts and better user experience.
Establish feedback loops: measure actual performance against projections, adjust infrastructure based on real data. This empirical approach prevents both over-provisioning (wasting cost) and under-provisioning (creating reliability issues).
Contents
- Deploy LLM Production: Complete Architecture and Optimization Guide
- Architecture Overview: Request Flow and Scaling Tiers
- Selecting GPU Hardware for Production Deployments
- vLLM Configuration for Production Workloads
- Load Balancing and Request Routing Patterns
- Monitoring and Observability Implementation
- Auto-Scaling Implementation Strategies
- Cost Optimization at Production Scale
- Deployment Checklist and Pre-Production Validation
- Advanced Configurations for Specific Workloads
- Disaster Recovery and Failover Scenarios
- Benchmarking The Deployment
- Production Checklist Before Deployment
- Cost Optimization at Different Scales
- Recommendation: Starting The Production Deployment
- FAQ
- Related Resources
- Sources
FAQ
Q: What's the minimum GPU setup for production? A: Two H100 instances minimum for redundancy. Single GPU deployments lack failover capability. Expect 2-3K tokens/second throughput per H100.
Q: How do I handle model updates without downtime? A: Use load balancer to drain traffic from old instances, update model, spin up new instances, verify health, shift traffic. Takes 5-10 minutes total.
Q: What's a reasonable uptime target for LLM inference? A: Aim for 99.5% minimum. That's 36 minutes downtime monthly. Achieve it through redundancy, monitoring, and auto-recovery.
Q: How do I reduce latency below 500ms? A: Use smaller models, quantization (int4/int8), and geographic load balancing. Also use edge processing where possible rather than central data centers.
Q: Should I deploy on Kubernetes or cloud VMs? A: Kubernetes is better for complex multi-region deployments. Cloud VMs are simpler for single-region setups. Start with cloud VMs, migrate to Kubernetes if scaling requires it.
Q: How much spare capacity should I maintain? A: Maintain 30-50% spare capacity for traffic spikes. This prevents immediate scaling and improves latency. It costs money but prevents timeout errors.
Q: What's the difference between batching and prefill optimization? A: Batching combines multiple requests into single GPU execution. Prefill optimization processes prompt input faster. Both matter. Prefill optimization reduces time-to-first-token latency.
Related Resources
- vLLM Official Documentation (external)
- RunPod GPU Infrastructure
- Lambda Labs Professional Inference
- H100 Pricing Comparison
- LLM API Pricing Comparison
- OpenAI API Pricing Reference
- Anthropic API Pricing Reference
Sources
- vLLM documentation and best practices (March 2026)
- Production deployment case studies from DeployBase
- GPU provider infrastructure documentation (March 2026)
- Real-world latency and throughput measurements
- Cost tracking data from deployed inference systems