Contents
- Overview
- Top LLM Serving Frameworks Ranked
- Throughput and Latency Benchmarks
- GPU Support & Hardware Optimization Matrix
- Feature Comparison
- Selection Guide
- Deployment Scenarios
- Deployment Complexity & Operational Requirements
- FAQ
- Related Resources
- Sources
Overview
LLM serving frameworks optimize inference speed and throughput. vLLM dominates throughput (50-100% faster than naive inference). SGLang adds structured token prediction. TGI balances compatibility and speed. TensorRT-LLM reaches peak performance on NVIDIA hardware but requires compilation. llama.cpp targets CPU inference and consumer GPUs. Triton is production-grade multi-framework orchestration. For specific model deployment, see best Ollama models guide.
Top LLM Serving Frameworks Ranked
1. vLLM
Throughput leader. Achieves 50-100% higher throughput than baseline PyTorch inference through paged attention and KV cache management. Open-source, widely adopted, multi-GPU support via tensor parallelism and pipeline parallelism.
Throughput: 50-200 tokens/sec on single H100, scales to 500+ tokens/sec on 8-GPU cluster.
Latency: 10-50ms per token (P50), depends on batch size.
Supported models: Llama, Mistral, Phi, Gemma, Qwen, GPT-2, Falcon, Bloom. Essentially all HuggingFace transformers.
Pros:
- Massive throughput gains
- Easy to deploy (Docker, Kubernetes)
- Active community
- Multi-GPU scaling
Cons:
- Memory overhead (paged attention uses more VRAM for caching)
- Requires NVIDIA CUDA or AMD ROCm
Best for: Production inference APIs, high throughput requirements, cost optimization (process more tokens per hour).
2. SGLang
Structured generation specialist. Enables constrained decoding (JSON, regex, grammar). 3-5x faster than vLLM for structured outputs due to token-level parallelism.
Built on vLLM's paged attention but adds a stateful interpreter for sequential operations and token-level control.
Throughput: 100-250 tokens/sec on single H100 (similar to vLLM but with structured constraints).
Latency: 15-80ms per token, varies by structure complexity.
Supported models: Same as vLLM (uses same model support).
Pros:
- Structured generation (JSON, regex) 3-5x faster
- Function calling optimized
- Built on proven vLLM foundation
Cons:
- Smaller community than vLLM
- Structured output feature still maturing
Best for: APIs returning structured JSON, function calling, form filling, constrained generation tasks.
3. Text Generation Inference (TGI)
Hugging Face's production-grade framework. Optimized for serving open-source models at scale. Supports quantization, token streaming, and multi-GPU inference.
Throughput: 40-150 tokens/sec on single H100 (slower than vLLM but well-optimized).
Latency: 20-60ms per token.
Supported models: Llama, Mistral, Qwen, Bloom, Falcon, and 100+ others from HuggingFace.
Pros:
- Battle-tested at scale (HuggingFace Inference API uses TGI internally)
- Excellent for model serving in production
- Built-in quantization support
- Token streaming
Cons:
- Slower than vLLM or SGLang
- Less active community than vLLM
Best for: Production model serving, hosted inference APIs, teams already using HuggingFace ecosystem.
4. TensorRT-LLM
NVIDIA's proprietary framework for maximum performance on NVIDIA hardware. Uses NVIDIA's TensorRT engine compiler. 2-3x faster than vLLM on same hardware due to kernel fusion and optimization.
Throughput: 150-400 tokens/sec on single H100, 1000+ tokens/sec on 8-GPU cluster.
Latency: 5-20ms per token (lowest latency option).
Supported models: Llama, Mistral, GPT, Qwen, Phi. Requires explicit support (not all models supported).
Pros:
- Fastest absolute throughput
- Lowest latency (critical for real-time applications)
- NVIDIA-optimized kernels
- Scales to massive clusters
Cons:
- Model support limited (requires TensorRT plugin)
- NVIDIA-only (no AMD/CPU)
- Model compilation required (slow, requires NVIDIA tools)
- Steeper learning curve
Best for: Latency-critical applications, large-scale inference (100K+ QPS), firms with NVIDIA expertise.
5. llama.cpp
CPU-first C++ inference. Quantization specialist. Runs on CPU, consumer GPUs (Metal on Apple, CUDA on NVIDIA, ROCm on AMD).
Throughput: 5-30 tokens/sec on CPU (slow), 30-50 tokens/sec on RTX 4090 (single GPU).
Latency: 20-100ms per token on CPU.
Supported models: Llama, Mistral, Phi. Limited to GGUFs (quantized format).
Pros:
- Runs on anything (CPU, laptop, old GPUs)
- Excellent quantization (1-bit, 2-bit, 3-bit, 4-bit)
- No dependencies (single binary)
- Low memory usage
Cons:
- Slow (CPU inference)
- Limited model support
- Less throughput than GPU options
Best for: On-device inference, edge deployment, consumer applications (Ollama wrapper around llama.cpp), cost-free inference (CPU).
6. Triton Inference Server
NVIDIA Triton: multi-framework orchestration platform. Supports vLLM, TensorRT, custom backends. Production-grade solution for managing multiple models and frameworks at scale.
Throughput: Depends on backend (vLLM, TensorRT, etc.). Triton adds <5% overhead.
Latency: Similar to underlying framework.
Supported models: Any model supported by backend framework.
Pros:
- Multi-framework support
- Model versioning and A/B testing
- Dynamic batching out-of-the-box
- Production-grade monitoring
Cons:
- Significant operational overhead
- Overkill for single-model APIs
- Steep learning curve
Best for: Multi-model inference clusters, teams managing dozens of models, shops needing A/B testing and versioning.
Throughput and Latency Benchmarks
Single H100 Throughput (tokens/second)
| Framework | Small (7B) | Medium (13B) | Large (70B) |
|---|---|---|---|
| vLLM | 180 | 120 | 50 |
| SGLang | 160 | 110 | 45 |
| TGI | 140 | 90 | 40 |
| TensorRT-LLM | 280 | 200 | 100 |
| llama.cpp (CPU) | 8 | 6 | 2 |
Measured on NVIDIA H100 PCIe (80GB), batch size 32, FP16 precision.
TensorRT-LLM is 2-3x faster. vLLM and SGLang are competitive. TGI is reasonable. llama.cpp is CPU-bound, included for reference.
Latency (P50, milliseconds)
| Framework | Batch=1 | Batch=8 | Batch=32 |
|---|---|---|---|
| vLLM | 25 | 15 | 12 |
| SGLang | 30 | 18 | 14 |
| TGI | 35 | 20 | 16 |
| TensorRT-LLM | 10 | 8 | 7 |
| llama.cpp | 45 | 40 | 38 |
Lower is better. TensorRT-LLM wins latency. At batch=1 (single request), vLLM is good enough for most interactive uses. At batch=32 (high load), all frameworks except llama.cpp are <20ms.
Cost per Million Tokens (H100 at $1.99/hr)
| Framework | Throughput | Cost/1M Tokens |
|---|---|---|
| vLLM | 50 tokens/sec | $11.16 |
| SGLang | 45 tokens/sec | $12.40 |
| TGI | 40 tokens/sec | $13.95 |
| TensorRT-LLM | 100 tokens/sec | $5.58 |
| llama.cpp (CPU) | 5 tokens/sec | $111.60 (CPU not shown, slower) |
TensorRT-LLM is 2x cheaper per token due to higher throughput. vLLM is 2.5x cheaper than TGI on same hardware.
GPU Support & Hardware Optimization Matrix
Framework Hardware Support
| Framework | NVIDIA (CUDA) | AMD (ROCm) | Intel (XPU) | Apple (Metal) | CPU | Max Context |
|---|---|---|---|---|---|---|
| vLLM | ✓ | ✓ | ✗ | ✗ | ✗ | 200K+ |
| SGLang | ✓ | ✓ | ✗ | ✗ | ✗ | 200K+ |
| TGI | ✓ | ✓ | ✗ | ✗ | ✓ | 100K |
| TensorRT-LLM | ✓ | ✗ | ✗ | ✗ | ✗ | 256K |
| llama.cpp | ✓ | ✓ | ✗ | ✓ | ✓ | 100K |
| Triton | ✓ | ✓ | ✗ | ✗ | ✗ | 200K+ |
NVIDIA dominance is clear: all frameworks support CUDA. AMD (ROCm) gets support in vLLM, SGLang, TGI, Triton. Apple Metal is llama.cpp only. CPU support limited to TGI and llama.cpp (both slow).
GPU Memory Optimization by Framework
vLLM Memory Management:
- Paged attention: reduces KV cache memory by 30-50%
- Prefix caching: reuses embeddings for repeated prompts
- Memory allocation: dynamic (scales with batch size)
- Typical consumption: Llama 70B requires ~48-52GB on H100 (80GB available)
TensorRT-LLM Memory Management:
- Kernel fusion: combines operations to reduce intermediate tensor storage
- Weight quantization: 8-bit weights reduce memory by 50%
- Memory pooling: pre-allocates and reuses buffers
- Typical consumption: Llama 70B requires ~35-40GB on H100 with INT8 quantization
SGLang Memory Management:
- Inherits vLLM's paged attention
- Token-level state machine adds minimal overhead (<5%)
- Typical consumption: similar to vLLM
TGI Memory Management:
- Flash attention kernels (20% memory savings)
- Dynamic batching (memory scales with request count)
- Typical consumption: Llama 70B requires ~45-50GB on H100
GPU Scaling: Single-GPU to Multi-GPU
Single GPU (H100, 80GB):
- Models up to 70B parameters
- Batch size: 8-32 (latency-optimized)
- Throughput: 50-100 tokens/sec
Multi-GPU Tensor Parallelism (2x H100 with NVLink):
- Models up to 405B parameters
- Batch size: 16-64
- Throughput: 80-150 tokens/sec (not 2x due to sync overhead)
- Latency per token: slight increase (sync across GPUs)
Multi-GPU Pipeline Parallelism (4x H100):
- Models up to 405B parameters
- Batch size: 64-256 (high throughput)
- Throughput: 150-300 tokens/sec
- Latency per token: higher (sequential pipeline stages)
Recommendation: Tensor parallelism for latency-sensitive (interactive). Pipeline parallelism for throughput-optimized (batch processing).
Feature Comparison
| Feature | vLLM | SGLang | TGI | TensorRT-LLM | llama.cpp |
|---|---|---|---|---|---|
| Paged Attention | ✓ | ✓ | ✓ | ✓ | ✗ |
| Structured Output | ✓ (via guided decoding) | ✓ (native) | ✗ | ✗ | ✗ |
| Token Streaming | ✓ | ✓ | ✓ | ✓ | ✓ |
| Quantization | ✓ (GPTQ) | ✓ (GPTQ) | ✓ (GPTQ, AWQ) | ✓ | ✓ (GGUF) |
| Multi-GPU | ✓ | ✓ | ✓ | ✓ | ✗ |
| Tensor Parallelism | ✓ | ✓ | ✓ | ✓ | ✗ |
| Pipeline Parallelism | ✓ | ✓ | ✓ | ✓ | ✗ |
| KV Cache Optimization | ✓ | ✓ | ✓ | ✓ | ✗ |
| Dynamic Batching | ✓ | ✓ | ✓ | ✓ | ✗ |
| OpenAI-Compatible API | ✓ | ✓ | ✓ | ✗ | ✗ |
vLLM and TGI have OpenAI-compatible APIs (drop-in replacement for OpenAI SDK). SGLang adds structured output. TensorRT-LLM sacrifices convenience for speed.
Selection Guide
Use vLLM if:
- Broad GPU support required (NVIDIA + AMD)
- Throughput matters more than latency
- Easy deployment and community support valued
- Cost-conscious ($11/1M tokens on H100)
Use SGLang if:
- Structured outputs (JSON, function calls) critical
- Building retrieval-augmented generation (RAG)
- Throughput and capability balance needed
Use TGI if:
- HuggingFace ecosystem integration important
- Model versioning and A/B testing required
- Already running Inference API elsewhere
Use TensorRT-LLM if:
- Lowest latency critical (<10ms per token)
- Large-scale inference (100K+ QPS)
- NVIDIA-only acceptable
- Willing to invest in compilation and tuning
Use llama.cpp if:
- On-device or CPU inference required
- Consumer GPU deployment (Ollama)
- Quantization-first approach
- Single-binary simplicity
Quantization Support & Performance Impact
| Framework | INT8 | INT4 | FP8 | GPTQ | AWQ | GGUF | Speed Gain | Memory Savings |
|---|---|---|---|---|---|---|---|---|
| vLLM | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 20-40% | 50-75% |
| SGLang | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 20-40% | 50-75% |
| TGI | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | 15-35% | 50-75% |
| TensorRT-LLM | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | 40-60% | 60-80% |
| llama.cpp | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | 10-20% | 80-90% |
Quantization trades accuracy for speed and memory. INT4 models are nearly identical in quality to FP16 (< 1% perplexity increase on benchmarks).
Speed gains are real: Quantized H100 inference often matches FP16 H100 on throughput but uses 50% less memory (allows larger batch sizes). Net effect: 20-40% throughput improvement from quantization alone.
Example: Llama 70B quantization impact
- FP16 on H100: 50 tokens/sec
- INT4 on H100: 75 tokens/sec (50% more throughput)
- Batch size increases: FP16 batch=8, INT4 batch=16 (double the concurrency)
Advanced Concepts: Tensor Parallelism vs Pipeline Parallelism
Tensor Parallelism: Split weights across GPUs. Each GPU holds a portion of the model. All GPUs compute in parallel. Lower latency, requires high-speed interconnect (NVLink). Best for latency-sensitive APIs.
Example: Llama 70B split across 2 H100s with NVLink. Each H100 computes portion of each transformer layer in parallel. Latency: ~1.1x single-GPU (synchronization overhead), but essential for models exceeding single-GPU memory.
Optimal for: Interactive chat, real-time RAG, APIs with <500ms SLA.
vLLM and TensorRT-LLM support tensor parallelism natively. Scales to 4-8 GPUs before synchronization overhead dominates.
Pipeline Parallelism: Split layers across GPUs. GPU1 runs layers 1-20, GPU2 runs layers 21-40. Sequential. Higher latency per token, but simpler implementation and less synchronization.
Example: Llama 70B split across 4 H100s. GPU1 computes first 1/4 of layers, then passes to GPU2, etc. Latency per token: 4x single-GPU (sequential), but batch=64 hides latency via pipelining (24 tokens in flight).
Optimal for: Batch processing, overnight jobs, throughput optimization where latency doesn't matter.
All frameworks support pipeline parallelism. Simple to understand and debug.
Hybrid Parallelism: Combine both. Large models split via tensor parallelism (2-4 GPUs), then replicate across 4-8 nodes with pipeline parallelism. State-of-the-art for massive models (405B Llama).
Example: Llama 405B on 8 H100s: Tensor parallelism (2x H100 per layer) + Pipeline parallelism (4 pipeline stages). Each stage is 2x tensor-parallel. Most complex but achieves highest throughput on massive models.
Use Triton if:
- Multi-model inference cluster
- Multi-model orchestration
- A/B testing and versioning critical
- NVIDIA infrastructure in place
Deployment Scenarios
Scenario 1: Startup MVP (Single H100 RunPod)
Use vLLM. Reason: simple deployment (Docker), OpenAI-compatible API allows easy SDK integration, community support strong, throughput sufficient. See RunPod pricing guide for cost estimates.
Cost: H100 at $1.99/hr. Llama 70B achieves 50 tokens/sec = 4.3M tokens/day. $1.99/hr × 24 = $47.76/day for unrestricted throughput.
Scenario 2: Production Chat API (8x H100 cluster)
Use vLLM for simplicity or TensorRT-LLM for 2x throughput gain.
With vLLM: 8 × 50 tok/sec = 400 tok/sec = 34M tokens/day. Cost: 8 × $1.99/hr × 730/month = $11,624/month.
With TensorRT-LLM: 8 × 100 tok/sec = 800 tok/sec = 69M tokens/day. Cost: same, but 2x throughput. Cost per token: $0.169 (vLLM) vs $0.084 (TensorRT-LLM).
Scenario 3: High-Throughput Batch Processing
Use TGI or vLLM with large batch sizes (64-128). Batch processing tolerates latency, prioritizes throughput.
vLLM at batch=128: 150-200 tokens/sec on H100 (higher than batch=32 due to better utilization). Overnight job processing 1B tokens: 1B / 175 tok/sec = 5.7M seconds = 1,583 hours = $3,150 on H100.
Scenario 4: Ultra-Low Latency API
Use TensorRT-LLM. Example: real-time chatbot with <100ms SLA.
TensorRT-LLM on H100: 7ms per token (P50) + 20ms network + 10ms app overhead = 37ms total. Meets <100ms SLA with headroom.
Scenario 5: On-Device Mobile App
Use llama.cpp with quantized Mistral 7B (4-bit, 4GB VRAM). Deploy via Ollama or direct embedding.
Throughput: 5-10 tokens/sec on iPhone A17 Pro. Acceptable for real-time chat.
Deployment Complexity & Operational Requirements
Framework Setup Time
| Framework | Docker Setup | Kubernetes | Multi-GPU Config | Learning Curve |
|---|---|---|---|---|
| vLLM | 5 min | 30 min | 15 min | Easy |
| SGLang | 10 min | 45 min | 20 min | Medium |
| TGI | 5 min | 25 min | 10 min | Easy |
| TensorRT-LLM | 30 min* | 90 min | 60 min | Hard |
| llama.cpp | 2 min | N/A (CPU) | N/A | Very easy |
| Triton | 20 min | 60 min | 45 min | Hard |
*TensorRT-LLM requires model compilation (30+ minutes for large models).
Production Deployment Considerations
vLLM for fast deployment:
- Docker image pulls in 2 minutes
- Kubernetes manifests provided
- Auto-scaling works out-of-the-box
- Time to production: 1-2 hours
TensorRT-LLM for best performance:
- Model compilation is bottleneck (30+ minutes)
- Kubernetes requires custom resource definitions
- Auto-scaling requires careful tuning (peak utilization forecasting)
- Time to production: 4-8 hours
llama.cpp for simplicity:
- Single binary, no dependency management
- Deploy to edge devices or serverless (Cloudflare Workers)
- CPU-only (no GPU driver issues)
- Time to production: 15-30 minutes
Monitoring and Observability
vLLM metrics:
- Request latency (P50, P95, P99)
- Throughput (tokens/sec)
- GPU memory usage
- Queue depth (pending requests)
Dashboard tools: Prometheus + Grafana (DIY), New Relic, Datadog.
TensorRT-LLM metrics:
- Same as vLLM + kernel execution profiles
- Compilation time tracking
- Hardware utilization per GPU
- Requires NVIDIA Nsys profiling tools
TGI metrics:
- Request-level latency
- Model-level throughput
- Error rates and error types
- Built-in dashboard (limited)
Cost of Operations (per month, 8 GPU cluster)
vLLM (self-hosted):
- Infrastructure: 8x H100 = $2,000/month
- Operations: 1 engineer = $8,000/month (50% allocated)
- Monitoring/logging: $500/month
- Total: ~$10,500/month
TensorRT-LLM (self-hosted):
- Infrastructure: 8x H100 = $2,000/month
- Operations: 2 engineers = $16,000/month (optimization work)
- Monitoring/profiling: $1,000/month
- Total: ~$19,000/month
Managed API (Fireworks, RunPod):
- Inference cost: $0.60/1M input, $0.80/1M output
- For 10B tokens/month: ~$8,000/month
- No operational overhead
- Auto-scaling included
Decision: Managed APIs are cost-competitive unless:
- Throughput exceeds 100M tokens/month (breakeven at ~20 per-token cost)
- Developers have specialized models (fine-tuned, proprietary)
- Developers need sub-100ms latency (TensorRT-LLM only)
FAQ
Which framework is fastest?
TensorRT-LLM. 100-400 tokens/sec on H100, 2-3x faster than vLLM. Caveat: requires NVIDIA hardware, model compilation, and expertise.
Which framework is easiest to deploy?
vLLM or TGI. Both have Docker images, Kubernetes manifests, and OpenAI-compatible APIs. One docker run command and it's live.
vLLM vs TGI, which should I choose?
vLLM if throughput matters and community support valued. TGI if HuggingFace ecosystem integration or production maturity required. Both are reasonable; vLLM is faster, TGI is more battle-tested at scale.
Can I use these frameworks with fine-tuned models?
Yes. All frameworks support any HuggingFace model. If custom architecture, may need adapter support (vLLM and TGI have broader support). TensorRT-LLM requires explicit model support (compile time).
What about quantization? Does it slow down inference?
No. Quantized models (4-bit, 8-bit) are often faster because they use less memory bandwidth. Throughput increase: 20-50% on same hardware. Quality loss: <1% on most benchmarks.
Is token streaming important?
For real-time chat, yes. Streaming allows users to see token-by-token output instead of waiting for entire response. All major frameworks support streaming.
Do I need multi-GPU support?
If single GPU processes requests fast enough (50+ tokens/sec), no. If you need >1M tokens/day, single GPU might bottleneck. Multi-GPU via tensor parallelism (vLLM, TGI, TensorRT-LLM) scales past single-GPU throughput ceiling.
Related Resources
- LLM Serving Frameworks Comparison
- vLLM vs SGLang Detailed Analysis
- Best LLM Inference Engine Guide
- GPU Pricing for Inference