Contents
- Overview
- Why Run AI Locally
- Hardware Requirements
- Tool Comparison
- Installation Guide
- Model Selection
- Basic Usage
- Performance Optimization
- Troubleshooting
- FAQ
- Related Resources
- Sources
Overview
Running AI models locally lets teams run large language models on their own computers with practical performance. Quantized models (4-bit, 8-bit) fit in 8GB-24GB RAM. Modern tools (Ollama, LM Studio, llama.cpp) abstract away the complexity. Inference speed is real-time: a Llama 2 7B model generates text at 5-15 tokens per second on a laptop.
This guide covers three tools: Ollama (easiest), LM Studio (most user-friendly), llama.cpp (most flexible). Pick one based on the OS and comfort level. No GPU required (though helpful).
Why Run AI Locally
Privacy
Models run on the machine. Prompts never leave the device. No cloud API, no logs, no third-party access. Critical for sensitive work: legal documents, medical notes, internal code review.
Cost
Zero recurring costs. Download a model once, run it forever. Compare to OpenAI API: $1.25-$25 per million tokens. For heavy inference (100M+ tokens/month), local inference saves thousands.
Latency
No network round-trip. Inference is instant (limited by CPU/GPU speed, not network). Interactive use cases (debugging with AI, real-time code suggestions) benefit from sub-second latency.
Control
Full model control. Modify prompts, combine models, integrate into proprietary systems without API restrictions. Run offline (no internet required).
Downsides
- Limited by local hardware (smaller models, slower inference than cloud clusters)
- No access to frontier models (GPT-4, Claude, Grok)
- Setup complexity (first-time install, dependency management)
- Maintenance burden (updates, troubleshooting)
Hardware Requirements
Minimum (Laptop/Desktop)
For 7B model (4-bit quantized):
- 8GB RAM (6GB for model, 2GB overhead)
- CPU: Intel i7 or equivalent (2020+)
- No GPU required
- SSD: 10GB free space
Performance:
- Token generation: 2-5 tokens/second (CPU only)
- Latency: 200-500ms per token (interactive but slow)
Cost: Free (use existing hardware).
Best for: Experimentation, privacy-sensitive work, offline use.
Recommended (Desktop)
For 13B model (4-bit quantized) or faster 7B:
- 16GB RAM
- GPU: NVIDIA RTX 3060 Ti or better (8GB VRAM)
- CPU: Any modern CPU (2021+)
- SSD: 50GB free space
Performance:
- Token generation: 15-30 tokens/second (GPU-accelerated)
- Latency: 30-60ms per token (smooth, real-time)
Cost: Depends on GPU (used: $200-500, new: $500-1000).
Best for: Regular use, professional applications, multiple concurrent users.
Power User (Workstation)
For 70B model (4-bit) or multiple concurrent models:
- 64GB+ RAM
- GPU: 2x RTX 4090 or NVIDIA H100
- SSD: 500GB+ free space
Performance:
- Token generation: 50-100 tokens/second (for 70B model)
- Latency: 10-20ms per token
Cost: $3000-10000.
Best for: Production inference, large teams, multiple model serving.
Note on GPUs
NVIDIA GPUs are best supported (CUDA toolkit). AMD GPUs are supported but slower (ROCm). Apple Silicon (M1/M2/M3) is well-supported (Metal acceleration). Intel GPUs are emerging (oneAPI).
For maximum compatibility, test on CPU first. GPU acceleration is optional but dramatically improves speed.
Tool Comparison
Ollama
What it is: Lightweight Docker-like wrapper for running LLMs locally.
Strengths:
- Easiest to install and use (single binary)
- Works on macOS, Linux, Windows
- Built-in model manager (download, update, switch)
- REST API included
- Excellent documentation
Weaknesses:
- Less control over quantization/parameters
- Slower than llama.cpp on some hardware
- Limited to pre-built models
Installation (macOS):
# Download the macOS app from ollama.com, then:
ollama pull llama3
ollama run llama3
Typical Speed:
- 7B model on CPU: 5-10 tokens/sec
- 7B model on GPU (RTX 3060): 20-30 tokens/sec
Best For: Beginners, macOS users, fast prototyping.
LM Studio
What it is: GUI application for running LLMs with minimal setup.
Strengths:
- Beautiful user interface
- Works on Windows, macOS, and Linux
- Download and run models with clicks
- No terminal knowledge required
- Built-in chat interface
- Good memory management
Weaknesses:
- Less flexible than llama.cpp
- Smaller model library than Ollama
- Higher memory overhead
Installation (Windows):
- Download LM Studio from lmstudio.AI
- Install
- Open app, select a model, click "Download"
- Click "Start Server"
- Chat in built-in interface
Typical Speed:
- 7B model on CPU: 3-8 tokens/sec
- 7B model on GPU (RTX 3060): 15-25 tokens/sec
Best For: Windows users, non-technical users, GUI preference.
llama.cpp
What it is: Lightweight C++ implementation of LLM inference. Most control, most optimization options.
Strengths:
- Fastest performance on CPU and GPU
- Highly configurable (quantization, batch size, threads)
- Cross-platform (macOS, Linux, Windows)
- Minimal dependencies
- Largest model compatibility
Weaknesses:
- Steepest learning curve (command-line only)
- Requires compilation or binary download
- No GUI
Installation (Linux):
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make
wget https://huggingface.co/...gguf
./main -m model.gguf -n 128 -p "Explain quantum computing:"
Typical Speed:
- 7B model on CPU: 8-15 tokens/sec
- 7B model on GPU (RTX 3060): 30-50 tokens/sec
Best For: Performance optimization, integration into pipelines, experienced users.
Quick Recommendation
| Skill Level | OS | Recommendation |
|---|---|---|
| Beginner | macOS | Ollama |
| Beginner | Windows | LM Studio |
| Beginner | Linux | Ollama |
| Intermediate | Any | llama.cpp |
| Advanced | Any | llama.cpp |
Installation Guide
Ollama on macOS
# Download and install the macOS app from ollama.com, then:
ollama pull llama3
ollama run llama3
Verification:
$ ollama run llama3
>>> What is 2+2?
4
>>> /bye
LM Studio on Windows
- Go to lmstudio.ai
- Download Windows installer
- Run installer
- Open LM Studio
- Click "Search Models" (left sidebar)
- Search "Mistral 7B"
- Click download on first result
- Wait for download (5-10 minutes)
- Click "Start Server"
- Type prompt in chat window, press Enter
Verification: Chat box displays model response in real-time.
llama.cpp on Linux
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make -j4
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/Mistral-7B-Instruct-v0.1.Q4_K_M.gguf
./main -m Mistral-7B-Instruct-v0.1.Q4_K_M.gguf \
-n 128 \
-p "Explain machine learning in 3 sentences:" \
-t 4 \
--gpu-layers 50
Verification: Terminal displays generated text.
Model Selection
Model Tiers by Size
| Size | Parameters | VRAM (4-bit) | Speed (CPU) | Quality | Best For |
|---|---|---|---|---|---|
| Tiny | 3B | 2GB | 15 tok/s | Basic | Learning, prototyping |
| Small | 7B | 4GB | 8 tok/s | Good | Most use cases |
| Medium | 13B | 8GB | 4 tok/s | Very Good | Production |
| Large | 34B | 20GB | 1 tok/s | Excellent | Complex reasoning |
| XL | 70B | 40GB | 0.5 tok/s | State-of-art | Research, accuracy |
Recommended Models (as of March 2026)
General Purpose:
- Llama 3 8B: Meta's current flagship small model. Fast, accurate, 128K context.
- Mistral 7B Instruct: Lightweight workhorse. Fast, smart, good defaults.
Coding:
- DeepSeek-Coder 7B: Code generation, completion, explanation.
- CodeLlama 34B: Better at complex functions, debugging (Llama 2-based).
Chat/Dialogue:
- Llama 3 8B Instruct: Natural conversation, follows instructions well.
- Mistral 7B Instruct: Lightweight alternative with solid instruction-following.
Specialized (Domain):
- Meditron 7B: Medical question-answering (fine-tuned for healthcare)
- WizardLM 13B: Better reasoning and problem-solving
Download and Setup
Ollama:
ollama pull mistral:7b-instruct
ollama pull llama3
ollama pull neural-chat
LM Studio:
Click "Download" in UI. Models auto-download to ~/LM Studio/models/.
llama.cpp:
wget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/Mistral-7B-Instruct-v0.1.Q4_K_M.gguf
Model Format Note: Downloads use GGUF format (quantized, optimized for CPU). Not raw weights. GGUF files are 3-5GB for 7B models (4-bit quantization).
Basic Usage
Ollama Chat
ollama run mistral:7b-instruct
>>> What is the capital of Japan?
Tokyo is the capital of Japan. It's located on the eastern coast of Honshu island and is the largest metropolitan area in the country.
>>> Summarize in one word.
Tokyo.
>>> /bye
Programmatic Use (Ollama API)
import requests
import json
url = "http://localhost:11434/api/generate"
payload = {
"model": "mistral:7b-instruct",
"prompt": "Explain machine learning in 3 sentences:",
"stream": False,
}
response = requests.post(url, json=payload)
result = response.json()
print(result['response'])
llama.cpp Command-Line
./main -m Mistral-7B-Instruct-v0.1.Q4_K_M.gguf \
-p "Q: What is photosynthesis? A:" \
-n 150 \
-t 4 \
--top-k 40 \
--top-p 0.9 \
--temp 0.7
Parameters:
-m: Model file path-p: Prompt-n: Number of tokens to generate-t: Number of threads (use CPU cores)--gpu-layers: Layers to run on GPU (if available)--temp: Temperature (higher = more creative)
LM Studio Server Mode
curl -X POST http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b-instruct",
"prompt": "Explain quantum computing:",
"max_tokens": 150
}'
Performance Optimization
CPU-Only Tips
1. Increase Thread Count
./main -m model.gguf -t 8 # Use 8 CPU threads
2. Reduce Batch Size
./main -m model.gguf -b 32
3. Lower Quantization (Smaller Model) Use Q4 instead of Q6. 4-bit quantization: ~2GB for 7B model. 6-bit: ~3GB.
4. Use Smaller Model Trade quality for speed. 7B model is 2-3x faster than 13B, still capable for most tasks.
5. Pre-allocate Memory
ulimit -v 16000000 # 16GB
GPU Acceleration
NVIDIA (CUDA):
LLAMA_CUDA=1 make -j4
./main -m model.gguf --gpu-layers 50
macOS (Metal):
LLAMA_METAL=1 make -j4
AMD (ROCm):
LLAMA_HIPBLAS=1 make -j4
Model Quantization
Use Q4_K_M quantization (4-bit, medium). Best balance of size and quality.
- Q2: Smallest (~1GB for 7B), lowest quality
- Q4: Small (~4GB for 7B), good quality
- Q5: Medium (~5GB), very good
- Q6: Larger (~6GB), high quality
- Q8: Full precision (~13GB), no loss
For most use cases, Q4_K_M is sufficient.
Troubleshooting
Out of Memory (OOM)
Symptom: Model crashes or becomes extremely slow.
Solution:
- Use smaller model (7B instead of 13B)
- Lower quantization (Q4 instead of Q6)
- Reduce batch size (
-b 16) - Increase system RAM (upgrade hardware)
./main -m tiny_model.gguf -b 8 -t 2 -n 50
Very Slow Inference
Symptom: Tokens generate at <1 token/second.
Likely Cause: Running on CPU, model too large.
Solution:
- Enable GPU acceleration (if available)
- Reduce number of threads (counterintuitively helps on some systems)
- Use smaller model
Model Downloads Fail
Symptom: Download stops mid-way or times out.
Solution:
wget --continue https://huggingface.co/.../model.gguf
Segmentation Fault (llama.cpp)
Symptom: Binary crashes on startup.
Solution:
- Update to latest llama.cpp (
git pull && make clean && make) - Redownload model (may be corrupted)
- Check if model matches binary (some versions incompatible)
FAQ
Do I need a GPU to run models locally?
No. CPU-only inference works on any modern computer (2020+). GPU is optional, speeds up inference by 5-20x. For 7B models on laptop CPU, expect 3-8 tokens/second. With GPU, expect 20-50 tokens/second.
Which tool is fastest?
llama.cpp is fastest (most optimized). Ollama and LM Studio add abstraction layers. On same hardware, llama.cpp generates ~20% more tokens/sec. For casual use, difference is imperceptible.
Can I run 70B models locally?
Yes, but requires 40GB+ VRAM or significant CPU RAM. Not practical without a workstation GPU (RTX 4090, H100) or a cluster. For most users, 7B-13B is the practical limit.
How much does inference cost locally?
Electricity only. For 1 hour of continuous inference on modern laptop: $0.01. GPU adds cost (electricity) but negligible on hobby use. For comparison, OpenAI API: $1.25 per million tokens ($0.001 per token).
Can I combine local models with cloud APIs?
Yes. Use local model for privacy-sensitive work, fall back to cloud API for heavier loads. Many frameworks support this hybrid approach (e.g., Python client that checks local server, falls back to OpenAI).
Is the quality as good as ChatGPT?
No. ChatGPT (GPT-4) is better at reasoning, math, complex tasks. Local models (Llama 2 70B, Mistral 7B) are good for text generation, summarization, classification, coding. For research, content creation, customer support, local models are sufficient. For math tutoring, complex reasoning, GPT is better.
How long does a model take to download?
7B model: 2-5 minutes (depends on internet speed). 13B: 5-10 minutes. 70B: 30-60 minutes.
Can I fine-tune a local model?
Yes. Tools like LoRA allow fine-tuning on single GPU. Requires programming knowledge. Most beginners skip this.
What about privacy with local models?
Data stays on device. No cloud upload. But be aware: if you connect to cloud services (integrate with cloud APIs, log analytics), some data may leak. For true privacy, run completely offline.