Self-Hosting LLM: Docker, Kubernetes, and Bare-Metal Options

FAQ
Related Resources
Sources

FAQ

Should we use Docker, Kubernetes, or bare-metal? Start with Docker for development. Use Kubernetes at 10+ concurrent users or 2+ machines. Bare-metal for cost-sensitive production (100+ concurrent users).

How do we handle GPU failover? Kubernetes automatically restarts failed pods on healthy nodes. Bare-metal requires manual intervention or commercial clustering software.

Can we mix GPU types in Kubernetes? Yes, with node labels. Schedule specific workloads to H100 vs. A100 nodes based on performance needs.

What's the minimum Kubernetes cluster size? 3 nodes recommended for production (fault tolerance). 1-2 nodes work for development.

How do we reduce model loading time? Use quantization (INT8/INT4), enable flash attention, use faster storage (local SSD vs. NFS).

Is self-hosting cheaper than APIs long-term? At 100+ requests/second sustained, self-hosting typically saves 50-70% vs. API pricing. Below that threshold, APIs are more cost-effective.

Sources

Kubernetes Documentation (kubernetes.io)
Docker Documentation (docs.docker.com)
vLLM Documentation (github.com/vllm-project/vllm)
NVIDIA CUDA Toolkit Documentation (docs.nvidia.com)
OpenAI Inference Best Practices (openai.com)

Contents

FAQ

Related Resources

Sources