Contents
- What is Llama.cpp?
- What is vLLM?
- Architectural Philosophies
- Hardware Requirements Comparison
- Quantization Support
- Throughput vs Latency Tradeoff
- When to Choose Llama.cpp
- When to Choose vLLM
- Integration Comparison
- Quantization Quality Impact
- Practical Performance Benchmarks
- Memory Requirements Deep Dive
- Production Deployment Considerations
- Hybrid Approach
- Advanced Deployment Patterns
- Quantization Quality Tradeoff Analysis
- Monitoring and Performance Profiling
- Framework Integration and Ecosystem
- Production Readiness Checklist
- Final Thoughts
Run on laptop or on GPU cluster.
llama.cpp: single request, minimal dependencies, costs nothing upfront.
vLLM: concurrent users, high throughput, needs GPU cluster.
What is Llama.cpp?
Llama.cpp is a C++ inference engine originally built to run Llama models on consumer hardware (laptops, desktops, edge devices). The software implements complete inference pipelines optimized for CPU execution, though it supports GPU acceleration via Metal (Apple), CUDA (NVIDIA), and OpenCL. Created by Georgi Gerganov, it quickly became the most popular tool for running LLMs locally.
Llama.cpp uses quantization extensively. The software converts model weights from float32 (4 bytes per weight) to int4 or int8 (0.5-1 byte per weight), reducing memory requirements 4-8x. This allows running 70B parameter models on 8GB RAM (with serious performance tradeoffs).
Key characteristics:
- Minimal dependencies (C++ core, no Python runtime)
- CPU-first design (runs well on consumer CPUs)
- Extreme quantization (int4 models run on 6-8GB RAM)
- Single user/single request (not designed for concurrent requests)
- Very low latency (100-500ms for short prompts on decent hardware)
What is vLLM?
vLLM is a Python inference framework built by UC Berkeley researchers, focusing on serving models at high throughput. vLLM implements sophisticated batching and scheduling algorithms to maximize GPU utilization, serving hundreds of concurrent requests from a single server.
vLLM assumes GPU infrastructure (NVIDIA/AMD GPUs) and doesn't optimize for CPU. The trade-off: vLLM requires powerful hardware but extracts maximum performance from that hardware.
Key characteristics:
- Multi-user serving (batches requests to maximize throughput)
- GPU-centric (designed for NVIDIA A100, H100, etc.)
- Advanced scheduling (paged attention algorithm for efficient memory)
- High throughput (thousands of tokens/second per GPU)
- Medium latency (200-1000ms depending on queue)
- Production-ready (handles real serving requirements)
Architectural Philosophies
The tools solve different problems due to different design assumptions.
Llama.cpp assumes: Developers have a single machine (laptop, desktop, or edge device), developers want to run inference locally without external dependencies, and developers need low latency for single requests.
vLLM assumes: Developers have GPU infrastructure, multiple users/clients will send requests, and developers want to maximize resource utilization by batching requests.
These lead to completely different implementation choices. Llama.cpp uses aggressive quantization because memory is scarce. vLLM keeps models in full precision (or lower precision but not extreme quantization) because GPUs have abundant memory and quantization impacts throughput.
Llama.cpp provides simple APIs: load model, run inference, get response. vLLM exposes batching details: clients send requests asynchronously, vLLM batches them for efficiency.
Hardware Requirements Comparison
Llama.cpp Requirements (for running Llama 7B model):
- CPU: Modern CPU (6+ cores) sufficient
- RAM: 8-16GB (with quantization)
- GPU: Optional, improves speed 3-5x
- Network: Not required (fully local)
A laptop with 16GB RAM and no GPU can run Llama 7B reasonably: ~100ms per token.
vLLM Requirements (for running Llama 7B model):
- GPU: NVIDIA A100 40GB or better (required)
- GPU RAM: 20-40GB for 7B model
- CPU: 8+ cores, 64GB+ RAM (much more than needed for single model)
- Network: Required (typically deployed as service)
vLLM won't run efficiently on CPU. The investment in GPU infrastructure is non-negotiable.
Cost implications:
- Llama.cpp: Run on existing hardware (laptop, edge device, or cheap server)
- vLLM: Requires GPU (A100 $1.19/hour on RunPod)
For a simple use case (serving 10 requests daily), llama.cpp on existing hardware costs $0. vLLM costs $8.64/day or $259/month even with minimal traffic.
Real-world performance expectations:
- Llama.cpp on MacBook M1: ~80ms per token (16GB unified memory)
- Llama.cpp on RTX 3090: ~25ms per token (24GB VRAM)
- Llama.cpp on RTX 4090: ~15ms per token (24GB VRAM)
- vLLM on A100 40GB: ~5ms per token (batching multiple requests)
Per-token latency improves dramatically with hardware investment, but llama.cpp on consumer hardware is remarkably usable.
Ecosystem and tooling:
- Llama.cpp: Simple, minimal dependencies, works everywhere (good for portability)
- vLLM: Complex, many dependencies, requires specific hardware (not portable)
For teams building production systems, vLLM's complexity is acceptable for its power. For teams building hobby projects or privacy-sensitive applications, llama.cpp's simplicity is huge advantage.
Quantization Support
Both tools support quantization but differently.
Llama.cpp quantization:
- Int4 (4-bit): 7B model = 3.5GB, reasonable quality
- Int5 (5-bit): 7B model = 4.4GB, slightly better quality
- Int8 (8-bit): 7B model = 7GB, excellent quality
- No float16/float32 (memory too constrained)
The extreme quantization in llama.cpp (int4) is necessary for fitting models in CPU RAM. A 7B int4 model fits in 4GB, making it practical for resource-constrained devices.
vLLM quantization:
- Float16 (preferred, 2x memory saving vs float32)
- Int8 quantization (optional, 4x memory saving but impacts quality)
- No extreme quantization (float16 sufficient with GPUs)
vLLM generally avoids extreme quantization because GPUs have abundant memory. A7B model in float16 uses 14GB. An A100 40GB can run two copies simultaneously, or mix it with LoRA adapters.
Throughput vs Latency Tradeoff
The tools optimize different metrics.
Llama.cpp optimizes latency:
- Single-user, single-request focus
- ~100-500ms time-to-first-token (latency)
- ~50-100ms per subsequent token (throughput)
- Throughput is secondary
Running "What is the capital of France?" takes ~100ms total (prompt processing + response generation). This is excellent latency for interactive applications.
vLLM optimizes throughput:
- Batches multiple requests
- ~500ms-2s time-to-first-token per request (latency increases when batching)
- ~10,000-50,000 tokens/second total throughput (many requests in parallel)
- Latency is secondary
If 100 requests arrive simultaneously, vLLM batches them, reduces per-request latency to 50-100ms by processing them together, achieving 10x higher total throughput.
Implication: Use llama.cpp for interactive applications prioritizing responsiveness. Use vLLM for high-volume serving where aggregate throughput matters more than individual latency.
When to Choose Llama.cpp
Choose llama.cpp when:
- Device constraints: Running on edge devices (phones, embedded systems, IoT)
- Privacy required: Processing sensitive data locally, avoiding cloud transfer
- No GPU available: Hardware lacks GPUs, or GPUs expensive
- Single-user application: Personal assistant, local tool, offline use
- Development environment: Testing locally before deploying to server
- Simple deployment: One command to run, no server infrastructure
Examples:
- Privacy-focused note-taking app with on-device summarization
- Offline chatbot running on laptop
- Local code completion tool for developers
- Edge device running inference without cloud connectivity
- Rapid prototyping before committing to server infrastructure
When to Choose vLLM
Choose vLLM when:
- High concurrency: Many users/requests simultaneously
- Cost optimization: GPUs available, need to maximize utilization
- Production serving: Running business-critical inference
- Throughput critical: Batch processing, not real-time interaction
- API service: Building LLM API for multiple downstream clients
- Complex workloads: LoRA adapters, multiple models, sophisticated routing
Examples:
- Commercial chatbot handling thousands of daily users
- Content generation service processing customer requests
- Document summarization platform for production
- LLM API service competing with OpenAI/Anthropic
- Batch inference job processing millions of documents overnight
Integration Comparison
Llama.cpp integration:
- Python library:
llama-cpp-python - Simple API:
llm = Llama(model_path="..."); response = llm("prompt") - Works with LangChain, LlamaIndex
- No network calls (everything local)
vLLM integration:
- Python library for running server:
from vllm import LLM - OpenAI API-compatible endpoint:
http://localhost:8000/v1/chat/completions - Works with any OpenAI client library
- Network-based (requests sent to server)
Llama.cpp integration is simpler for single-script applications. vLLM requires server deployment, adding complexity but enabling multi-service architectures.
Quantization Quality Impact
Quality degradation from quantization is significant.
Quality at different quantization levels (subjective, 1-10 scale):
- Float32: 10.0 (baseline)
- Float16: 9.8 (negligible difference)
- Int8: 9.2 (minor loss, mostly unnoticeable)
- Int5: 8.7 (noticeable but acceptable)
- Int4: 8.0 (noticeable, some tasks affected)
For code generation, int4 quantization can reduce correctness 10-20%. For creative writing or summarization, impact is minimal.
Llama.cpp's int4 quantization is aggressive. If maximum quality matters (code generation, reasoning), consider using higher precision models (13B int5) or accepting quality loss.
vLLM uses float16 by default, losing almost no quality compared to float32 while saving memory.
Practical Performance Benchmarks
Scenario: 7B model, single request, "Generate a 100-word story"
Llama.cpp (i7 16-core, 32GB RAM):
- Time to first token: 150ms
- Tokens per second: 15
- Total time: ~150ms + 7s = 7.15s
Llama.cpp (RTX 4090 GPU):
- Time to first token: 50ms
- Tokens per second: 200
- Total time: ~50ms + 0.5s = 0.55s
vLLM (A100 40GB, batch size 1):
- Time to first token: 200ms
- Tokens per second: 1000
- Total time: ~200ms + 0.1s = 0.3s
vLLM wins on speed but requires expensive GPU. Llama.cpp on RTX 4090 offers reasonable speed with lower cost.
Scenario: 100 concurrent users, each requesting 50-token response
Llama.cpp (RTX 4090):
- Processes one request at a time
- 100 requests × 0.3s = 30 seconds total
- Only one user gets fast response, others wait
vLLM (A100 40GB, batching):
- Batches requests, processes ~100 requests in parallel
- 100 requests × 0.3s = 0.3 seconds total
- All users get ~0.3s response time
For concurrent workloads, vLLM is vastly superior.
Memory Requirements Deep Dive
Llama.cpp memory usage (7B model):
- Model weights (int4): 3.5GB
- KV cache for 2048 token context: 1.8GB
- Computation buffers: 500MB
- Total: ~5.8GB
A laptop with 16GB RAM can comfortably run llama.cpp.
vLLM memory usage (7B model, batch size 256):
- Model weights (float16): 14GB
- KV cache for 2048 token context × 256 batch: 14GB
- Computation buffers: 2GB
- Total: ~30GB
Requires GPU with 40GB+ VRAM, limiting to premium hardware.
This explains the cost difference: vLLM requires expensive hardware, llama.cpp uses existing machines.
Production Deployment Considerations
Llama.cpp for production:
- Limited concurrency (handle one request at a time)
- Queue requests externally (use task queue like Celery)
- Suitable for low-traffic applications
- Simple scaling (add more machines)
vLLM for production:
- Built for production (proper error handling, logging, metrics)
- Horizontal scaling (add more GPU instances behind load balancer)
- Monitoring integration (Prometheus metrics exported)
- API standards (OpenAI API compatibility)
If developers anticipate high traffic, vLLM's production-ready design saves engineering effort later.
Hybrid Approach
Many teams use both:
- Development: Llama.cpp for local testing on laptops
- Production: vLLM for high-traffic serving
- Edge: Llama.cpp for on-device inference
This maximizes flexibility: developers can test locally, production handles scale, edge devices have offline capability.
Advanced Deployment Patterns
Sophisticated teams use both tools in layered architectures.
Edge + cloud hybrid: Deploy llama.cpp on edge devices (phones, local machines) for low-latency, privacy-preserving inference. Route complex requests to cloud vLLM for higher quality. This provides responsive experience locally while maintaining quality where needed.
Cascade inference: Route simple requests to llama.cpp (fast, local), complex requests to vLLM (slower, more capable). Measure request complexity and route accordingly. This optimizes cost and latency.
Local augmentation: Run llama.cpp for embedding generation (fast, local), send embeddings to vLLM for semantic search. Combines local efficiency with cloud capability.
Development to production progression: Develop and test locally using llama.cpp (free, instant iterations), deploy to production using vLLM (mature, scalable). Same code path, different execution environment.
Quantization Quality Tradeoff Analysis
Quality loss from quantization varies by application.
High-quality-critical applications (code generation, reasoning):
- Use float16 minimum (llama.cpp int5, vLLM float16)
- Quality loss under 5% acceptable
- Extreme quantization (int4) causes 15-30% accuracy loss
Quality-tolerant applications (summarization, classification):
- int4 quantization acceptable
- Quality loss under 20% usually unnoticed
- Extreme compression justified
Quality-agnostic applications (embedding, simple classification):
- Extreme quantization fine
- int2-int4 usable
- Minimal quality loss observable
Know the application's quality sensitivity. Profile quality at different quantization levels. Select the most aggressive quantization maintaining acceptable quality.
Monitoring and Performance Profiling
For production deployments, understand actual performance characteristics.
Latency SLI tracking: Measure p50, p95, p99 latency percentiles. Single-request average latency doesn't tell the story. p99 latency (99% of requests complete faster than this) determines SLA compliance.
Throughput measurement: Requests per second metric varies based on request size, GPU memory pressure, quantization. Profile with realistic request loads.
Memory profiling: Monitor GPU/CPU memory usage under load. Memory leaks accumulate over hours/days, degrading performance. Restart processes periodically if leaks detected.
Cost per request: Divide infrastructure cost by request volume. Optimization targets might shift (if vLLM costs $10/month for 1M requests, improving throughput 2x reduces cost per request 50%).
Framework Integration and Ecosystem
Llama.cpp and vLLM integrate with broader frameworks differently.
LangChain integration:
- Llama.cpp:
Ollamaintegration (runs local llama.cpp) - vLLM: REST API compatible with OpenAI format
Both work with LangChain, but differently. Llama.cpp integrates tightly (managed lifecycle), vLLM is API-based (independent server).
LlamaIndex integration: Similar pattern. Llama.cpp for local, vLLM for remote.
Deployment orchestration: vLLM works with Kubernetes (stateless, horizontally scalable). Llama.cpp works with containers but less naturally scalable.
Production Readiness Checklist
Before production deployment, verify:
Llama.cpp production readiness:
- Model quantization quality verified
- Latency SLA confirmed
- Memory requirements profiled
- Error handling implemented
- Logging and monitoring configured
- Fallback mechanism for failures
- Process restart on crash configured
- Rate limiting implemented
vLLM production readiness:
- Kubernetes manifests tested
- Health checks configured
- Autoscaling tested under load
- Database for conversation history
- Load balancer configured
- Monitoring and alerts enabled
- Disaster recovery procedures documented
- Cost tracking implemented
Final Thoughts
Llama.cpp and vLLM solve different problems. Llama.cpp excels at resource-constrained, local, privacy-first inference. vLLM excels at high-concurrency, high-throughput serving with GPUs.
For hobby projects, research, or privacy-critical applications, llama.cpp is the clear choice. For production serving with multiple users, vLLM is required. Most mature ML teams use both, selecting based on context.
Start with llama.cpp locally to understand the workload. When traffic demands scale or concurrency requirements grow, migrate to vLLM on GPU infrastructure. This progression minimizes investment until proven necessary.
Many sophisticated deployments use both simultaneously: llama.cpp for edge devices and development, vLLM for production serving. This maximizes flexibility while managing cost.
Build layered inference into the architecture. Local inference first (fast, cheap, private). Cloud inference for complex requests. Cascade between layers based on query complexity. This architecture provides the best user experience with optimal cost structure.