Contents
- How to Run LLM Locally: Complete Guide
- Hardware Requirements
- Software Stack Selection
- Step-by-Step Setup
- Inference Engine Optimization
- Performance Tuning
- FAQ
- Related Resources
- Sources
How to Run LLM Locally: Complete Guide
How to Run LLM Locally is the focus of this guide. Local LLMs kill API costs and latency. Phi-3 Mini (3.8B) runs on laptops. Mistral 7B on budget GPUs. Llama 70B needs high-end cards. As of March 2026, production-ready. Setup's harder than APIs. Community is huge. Hardware matters. Savings are real at scale.
Hardware Requirements
Minimum specs for 7B parameter models
- GPU: RTX 4060 (8GB VRAM) or equivalent AMD
- RAM: 16GB system memory
- Storage: 50GB SSD
- Inference speed: 20-40 tokens/second
Recommended specs for 13B models
- GPU: RTX 4070 (12GB) or RTX 4070 Ti (12GB)
- RAM: 32GB
- Storage: 100GB SSD
- Inference speed: 40-80 tokens/second
Optimal specs for 70B models
- GPU: RTX 4090 (24GB) or A100 (40GB+)
- RAM: 64GB
- Storage: 200GB SSD
- Inference speed: 80-150 tokens/second
CPU-only inference viable for 7B models. 5-10 tokens/second. Acceptable for development, unacceptable for production.
Memory math: 7B in FP16 needs 14GB. 4-bit: 3.5GB. 8-bit: 7GB.
Formula: (params × 2) / 1B = GB. Llama 70B FP16: 140GB. 4-bit: 35GB. 8-bit: 70GB.
Software Stack Selection
Ollama: simplest option. Installation: one command. Models auto-download. Run locally with zero configuration. Recommended for beginners.
LM Studio: GUI wrapper. User-friendly. Model management visual. Performance acceptable. Good for experimentation.
vLLM: production inference engine. Highest throughput. Batch processing optimized. Complex setup. Steep learning curve.
Text Generation WebUI: feature-rich interface. Multiple backends supported. Good for fine-tuning experiments. Learning curve moderate.
Llama.cpp: minimal dependencies. Runs on weak hardware. Straightforward C++ backend. Popular for edge devices.
GPT4All: desktop GUI. Simple inference. Limited model selection. No fine-tuning support.
Docker containers: reproducible environment. Isolation between projects. Overhead negligible. Recommended for production.
Step-by-Step Setup
Option 1: Ollama (recommended for beginners)
Installation. macOS/Linux/Windows:
curl -fsSL https://ollama.com/install.sh | sh
Run model:
ollama run llama3
Specify variant:
ollama run mistral:7b-instruct
Access via REST API:
curl -X POST http://localhost:11434/api/generate -d '{"model":"mistral:7b-instruct", "prompt":"Hello, world", "stream":false}'
That's it. No CUDA setup. No dependency management.
Option 2: vLLM (recommended for production)
Installation:
pip install vllm
Download model from Hugging Face:
git-lfs install
git clone https://huggingface.co/mistralai/Mistral-7B-v0.1
Run server:
python -m vllm.entrypoints.openai.api_server --model ./Mistral-7B-v0.1 --dtype float16
Client code:
import requests
response = requests.post(
'http://localhost:8000/v1/completions',
json={'model': 'mistral', 'prompt': 'Hello', 'max_tokens': 100}
)
vLLM handles quantization, batching, caching automatically.
Option 3: LLama.cpp (minimal hardware)
Installation:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make
Download quantized model:
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-GGUF/resolve/main/Mistral-7B-Instruct-Q4_K_M.gguf
Run inference:
./main -m ./Mistral-7B-Instruct-Q4_K_M.gguf -n 256 -p "Hello"
Minimal dependencies. Works on CPUs. Perfect for development.
Inference Engine Optimization
Quantization trade-offs
FP16 (16-bit float): highest quality, largest model. 14GB for 7B model. Preferred baseline.
INT8 (8-bit integer): 50% memory reduction. 1-2% quality loss. Worth the savings.
4-bit quantization (GPTQ, AWQ, QLoRA): 75% memory reduction. 5-10% quality loss. Acceptable for most tasks.
2-bit quantization: 87% reduction. 15%+ quality loss. Rarely justified.
Quantization method matters. GPTQ: accurate, slow to quantize. AWQ: faster quantization, comparable quality. QLoRA: only for fine-tuning.
Batch processing optimization
Single request: 100ms latency. 10 requests: 100ms latency (amortized). Batching 10 requests: same time, 1000ms wall time. Trade latency for throughput.
vLLM handles batching automatically. Requests queued. Processed in batches. Throughput 10x improvement.
KV cache optimization
Transformer models recompute previous tokens. Wasteful. KV cache stores intermediate results. Reduces computation 95%. Memory increase 5%. Always enable.
Memory pooling
vLLM allocates memory blocks. Reuses between requests. Reduces memory fragmentation. Allocation overhead minimal.
Prefix caching
Common prompt prefixes cached. System messages, instructions. Prefix identical across requests. vLLM caches automatically.
Performance Tuning
Temperature tuning. Default 1.0. Range 0.1-2.0. Lower: deterministic, repetitive. Higher: creative, inconsistent. Task-dependent.
Top-p sampling. Default 0.95. Range 0.0-1.0. Lower: more focused. Higher: more diverse.
Max tokens. Inference stops after limit. Prevents runaway generation. Set appropriately for use case.
Timeout settings. Inference timeout: 60-300 seconds depending on max tokens. Prevent hanging requests.
GPU memory fraction. Reserve portion of VRAM for other tasks. Reduce to 0.9 if out-of-memory errors.
Dtype precision. Use bfloat16 (fast, slightly lower quality). Use float16 (standard quality). Avoid float32 (memory hog).
Benchmarking. Test locally with production data. Measure latency, throughput, quality. Optimize trade-offs.
FAQ
What's the cheapest way to start running LLMs locally?
CPU inference on existing laptop. Phi-4 Mini (3.8B) or Phi-3 Mini runs at 5-10 tokens/second. Zero hardware cost. Download Ollama and run.
Should we buy a GPU for local LLM inference?
If running >100 requests daily: yes. RTX 4070 cost: $600. Amortized over 3 years: $0.07/hour. Faster than API latency. Worth the cost.
Which model for local deployment?
Mistral 7B for balance. Llama 3 8B for quality. Phi-3 Mini (3.8B) for resource efficiency. Depends on hardware and use case.
Can we quantize already quantized models?
No. Quantization one-way. Post-quantization recomputation loses information. Start with full precision.
Is local inference production-ready?
Yes. vLLM proven at scale. Careful monitoring necessary. Out-of-memory kills requests. Graceful degradation needed.
Related Resources
Sources
Ollama documentation (https://ollama.ai/) vLLM documentation (https://vllm.ai/) Llama.cpp repository (https://github.com/ggml-org/llama.cpp) LM Studio (https://lmstudio.ai/) GPTQ quantization paper (https://arxiv.org/abs/2210.17323) AWQ quantization paper (https://arxiv.org/abs/2306.00978) Mistral 7B model card (https://huggingface.co/mistralai/Mistral-7B-v0.1) Phi model documentation (https://huggingface.co/microsoft/phi-2)