Contents
- GPU Hardware Requirements
- Top Models for Consumer Hardware
- Quantization Techniques
- Installation and Runtime
- Benchmarks and Performance
- FAQ
- Related Resources
- Sources
GPU Hardware Requirements
12GB VRAM: 7B models, 4-bit quantization. 16GB: 7B comfortable. 24GB: 13B models. 48GB+: unquantized 70B or multi-model.
NVIDIA wins on ecosystem. RTX 4060 Ti (16GB): $399-449, best value. RTX 4090 (24GB): more headroom. AMD ROCm works but tooling is rough. Intel Arc A770 (16GB): budget option.
Top Models for Consumer Hardware
Mistral 7B: 14GB unquantized, ~4.4GB quantized 4-bit. 25-35 tok/s. 32K context. Apache 2.0. Beats Llama 2 13B.
Llama 2 7B: 14GB unquantized. 28-38 tok/s. 4K context. Meta license (commercial restricted). Multilingual.
Phi-2 2.7B: 5.5GB unquantized. 85-120 tok/s (fastest). 2K context. MIT. Best reasoning per param.
Openchat 3.5 7B: Mistral fine-tune. 26-36 tok/s. 8K context. Apache 2.0. Better dialogue.
Orca 2 7B: Instruction-following focus. Competes with bigger models.
Performance characteristics:
- Throughput: 24-34 tokens/second on RTX 4060 Ti
- Context window: 4K tokens
- License: Microsoft Research License (research-only)
- Notable: Superior instruction adherence compared to base Llama 2
Quantization Techniques
4-Bit Quantization (BitsAndBytes)
Reduces model size by 75% with minimal quality degradation. Using BitsAndBytes library:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B",
quantization_config=bnb_config
)
Quality loss remains imperceptible for most tasks. Typical reduction from 14GB to ~4.4GB on 7B models.
8-Bit Quantization
Less aggressive than 4-bit, preserving slightly more quality. Reduces 7B models from 14GB to 7GB. Recommended when 4-bit produces insufficient quality.
GGUF Format (llama.cpp)
CPU-optimized format enabling inference without GPU acceleration. Useful for CPU-only systems or batch processing. Trade-off involves 5-10x slower throughput than GPU execution.
Installation and Runtime
Ollama Setup (Easiest Method)
Ollama abstracts quantization and hardware detection:
curl https://ollama.ai/install.sh | sh
ollama run mistral
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Why is the sky blue?",
"stream": false
}'
Ollama automatically selects quantization level based on available VRAM. Mistral defaults to 4-bit quantization on systems with 12-16GB VRAM.
Manual Setup with Transformers Library
Greater control but requires dependency management:
pip install transformers torch bitsandbytes
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = 'mistralai/Mistral-7B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
load_in_4bit=True
)
inputs = tokenizer('Explain quantum computing', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
"
vLLM for Production Inference
vLLM optimizes batched inference for API deployments:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mistral-7B \
--quantization awq \
--tensor-parallel-size 1
Serves OpenAI-compatible API endpoints with automatic batching and memory optimization.
Benchmarks and Performance
Real-world inference metrics on RTX 4060 Ti (16GB VRAM):
| Model | Precision | VRAM | Throughput | Quality Score |
|---|---|---|---|---|
| Phi-2 2.7B | 4-bit | 2.8GB | 110 tok/s | 72/100 |
| Mistral 7B | 4-bit | 4.4GB | 32 tok/s | 85/100 |
| Llama 2 7B | 4-bit | 3.6GB | 34 tok/s | 84/100 |
| Openchat 3.5 7B | 4-bit | 3.5GB | 31 tok/s | 86/100 |
| Orca 2 7B | 8-bit | 7.1GB | 28 tok/s | 87/100 |
Quality scores reflect performance on MMLU, HellaSwag, and GSM8K benchmarks. Differences between 7B models remain marginal.
Model selection should prioritize task-specific evaluation rather than aggregate scores. A quantized Mistral may outperform Llama 2 on the specific use case.
For API-based alternatives requiring no local hardware, see the LLM API pricing guide comparing inference costs across providers.
FAQ
Q: What GPU should I buy for running 7B models? RTX 4060 Ti (16GB) at $399 offers best value for consumer hardware. For experimentation, RTX 4090 (24GB) provides comfortable headroom. Skip RTX 4060 12GB due to memory constraints.
Q: How much quality is lost with 4-bit quantization? Imperceptible quality loss for most tasks. Benchmarks show 1-3% performance reduction. Only use 8-bit or higher if 4-bit produces unacceptable output for specific use cases.
Q: Can I run multiple models simultaneously?
Yes. Two quantized 7B models consume ~7GB VRAM total on RTX 4060 Ti, leaving room for other workloads. Monitor memory usage with nvidia-smi during execution.
Q: What's faster: local inference or API calls? Local RTX 4060 Ti inference (32 tok/s) often matches cloud API latency while avoiding API costs. Trade-off involves upfront hardware investment.
Q: Does CPU-only inference work? Yes, via llama.cpp or GGML formats. Expect 0.3-0.8 tok/s on modern CPUs. Suitable only for non-latency-sensitive batch workloads.
Q: How do I serve a local model as an API? Use vLLM or text-generation-webui for OpenAI-compatible API endpoints. Deploy behind nginx for load balancing across multiple GPU instances.
Related Resources
- Complete LLM API Pricing Comparison
- RunPod GPU Pricing for LLM Deployment
- NVIDIA H100 Pricing
- Anthropic API Pricing for Comparison
- GPU Cloud Pricing Guide
Sources
- Mistral AI Official Documentation
- Meta Llama 2 Research Paper
- Microsoft Phi-2 Technical Report
- Hugging Face Model Leaderboards (March 2026)
- vLLM Official Documentation
- BitsAndBytes GitHub Repository