Contents
FAQ
Should we use Docker, Kubernetes, or bare-metal? Start with Docker for development. Use Kubernetes at 10+ concurrent users or 2+ machines. Bare-metal for cost-sensitive production (100+ concurrent users).
How do we handle GPU failover? Kubernetes automatically restarts failed pods on healthy nodes. Bare-metal requires manual intervention or commercial clustering software.
Can we mix GPU types in Kubernetes? Yes, with node labels. Schedule specific workloads to H100 vs. A100 nodes based on performance needs.
What's the minimum Kubernetes cluster size? 3 nodes recommended for production (fault tolerance). 1-2 nodes work for development.
How do we reduce model loading time? Use quantization (INT8/INT4), enable flash attention, use faster storage (local SSD vs. NFS).
Is self-hosting cheaper than APIs long-term? At 100+ requests/second sustained, self-hosting typically saves 50-70% vs. API pricing. Below that threshold, APIs are more cost-effective.
Related Resources
- GPU Memory Requirements Guide
- Best Model Serving Platforms
- Fine-Tune LLM for Chatbot
- Lambda GPU Pricing
- RunPod GPU Pricing
Sources
- Kubernetes Documentation (kubernetes.io)
- Docker Documentation (docs.docker.com)
- vLLM Documentation (github.com/vllm-project/vllm)
- NVIDIA CUDA Toolkit Documentation (docs.nvidia.com)
- OpenAI Inference Best Practices (openai.com)