Small Open Source LLMs That Run on Consumer GPUs

GPU Hardware Requirements
Top Models for Consumer Hardware
Quantization Techniques
Installation and Runtime
Benchmarks and Performance
FAQ
Related Resources
Sources

GPU Hardware Requirements

12GB VRAM: 7B models, 4-bit quantization. 16GB: 7B comfortable. 24GB: 13B models. 48GB+: unquantized 70B or multi-model.

NVIDIA wins on ecosystem. RTX 4060 Ti (16GB): $399-449, best value. RTX 4090 (24GB): more headroom. AMD ROCm works but tooling is rough. Intel Arc A770 (16GB): budget option.

Top Models for Consumer Hardware

Mistral 7B: 14GB unquantized, ~4.4GB quantized 4-bit. 25-35 tok/s. 32K context. Apache 2.0. Beats Llama 2 13B.

Llama 2 7B: 14GB unquantized. 28-38 tok/s. 4K context. Meta license (commercial restricted). Multilingual.

Phi-2 2.7B: 5.5GB unquantized. 85-120 tok/s (fastest). 2K context. MIT. Best reasoning per param.

Openchat 3.5 7B: Mistral fine-tune. 26-36 tok/s. 8K context. Apache 2.0. Better dialogue.

Orca 2 7B: Instruction-following focus. Competes with bigger models.

Performance characteristics:

Throughput: 24-34 tokens/second on RTX 4060 Ti
Context window: 4K tokens
License: Microsoft Research License (research-only)
Notable: Superior instruction adherence compared to base Llama 2

Quantization Techniques

4-Bit Quantization (BitsAndBytes)

Reduces model size by 75% with minimal quality degradation. Using BitsAndBytes library:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B",
    quantization_config=bnb_config
)

Quality loss remains imperceptible for most tasks. Typical reduction from 14GB to ~4.4GB on 7B models.

8-Bit Quantization

Less aggressive than 4-bit, preserving slightly more quality. Reduces 7B models from 14GB to 7GB. Recommended when 4-bit produces insufficient quality.

GGUF Format (llama.cpp)

CPU-optimized format enabling inference without GPU acceleration. Useful for CPU-only systems or batch processing. Trade-off involves 5-10x slower throughput than GPU execution.

Installation and Runtime

Ollama Setup (Easiest Method)

Ollama abstracts quantization and hardware detection:

curl https://ollama.ai/install.sh | sh

ollama run mistral

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Ollama automatically selects quantization level based on available VRAM. Mistral defaults to 4-bit quantization on systems with 12-16GB VRAM.

Manual Setup with Transformers Library

Greater control but requires dependency management:

pip install transformers torch bitsandbytes
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = 'mistralai/Mistral-7B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    load_in_4bit=True
)

inputs = tokenizer('Explain quantum computing', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
"

vLLM for Production Inference

vLLM optimizes batched inference for API deployments:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B \
  --quantization awq \
  --tensor-parallel-size 1

Serves OpenAI-compatible API endpoints with automatic batching and memory optimization.

Benchmarks and Performance

Real-world inference metrics on RTX 4060 Ti (16GB VRAM):

Model	Precision	VRAM	Throughput	Quality Score
Phi-2 2.7B	4-bit	2.8GB	110 tok/s	72/100
Mistral 7B	4-bit	4.4GB	32 tok/s	85/100
Llama 2 7B	4-bit	3.6GB	34 tok/s	84/100
Openchat 3.5 7B	4-bit	3.5GB	31 tok/s	86/100
Orca 2 7B	8-bit	7.1GB	28 tok/s	87/100

Quality scores reflect performance on MMLU, HellaSwag, and GSM8K benchmarks. Differences between 7B models remain marginal.

Model selection should prioritize task-specific evaluation rather than aggregate scores. A quantized Mistral may outperform Llama 2 on the specific use case.

For API-based alternatives requiring no local hardware, see the LLM API pricing guide comparing inference costs across providers.

FAQ

Q: What GPU should I buy for running 7B models? RTX 4060 Ti (16GB) at $399 offers best value for consumer hardware. For experimentation, RTX 4090 (24GB) provides comfortable headroom. Skip RTX 4060 12GB due to memory constraints.

Q: How much quality is lost with 4-bit quantization? Imperceptible quality loss for most tasks. Benchmarks show 1-3% performance reduction. Only use 8-bit or higher if 4-bit produces unacceptable output for specific use cases.

Q: Can I run multiple models simultaneously? Yes. Two quantized 7B models consume ~7GB VRAM total on RTX 4060 Ti, leaving room for other workloads. Monitor memory usage with nvidia-smi during execution.

Q: What's faster: local inference or API calls? Local RTX 4060 Ti inference (32 tok/s) often matches cloud API latency while avoiding API costs. Trade-off involves upfront hardware investment.

Q: Does CPU-only inference work? Yes, via llama.cpp or GGML formats. Expect 0.3-0.8 tok/s on modern CPUs. Suitable only for non-latency-sensitive batch workloads.

Q: How do I serve a local model as an API? Use vLLM or text-generation-webui for OpenAI-compatible API endpoints. Deploy behind nginx for load balancing across multiple GPU instances.

Sources

Mistral AI Official Documentation
Meta Llama 2 Research Paper
Microsoft Phi-2 Technical Report
Hugging Face Model Leaderboards (March 2026)
vLLM Official Documentation
BitsAndBytes GitHub Repository

Contents