Small Open Source LLMs That Run on Consumer GPUs

Deploybase · April 14, 2025 · LLM Guides

Contents

GPU Hardware Requirements

12GB VRAM: 7B models, 4-bit quantization. 16GB: 7B comfortable. 24GB: 13B models. 48GB+: unquantized 70B or multi-model.

NVIDIA wins on ecosystem. RTX 4060 Ti (16GB): $399-449, best value. RTX 4090 (24GB): more headroom. AMD ROCm works but tooling is rough. Intel Arc A770 (16GB): budget option.

Top Models for Consumer Hardware

Mistral 7B: 14GB unquantized, ~4.4GB quantized 4-bit. 25-35 tok/s. 32K context. Apache 2.0. Beats Llama 2 13B.

Llama 2 7B: 14GB unquantized. 28-38 tok/s. 4K context. Meta license (commercial restricted). Multilingual.

Phi-2 2.7B: 5.5GB unquantized. 85-120 tok/s (fastest). 2K context. MIT. Best reasoning per param.

Openchat 3.5 7B: Mistral fine-tune. 26-36 tok/s. 8K context. Apache 2.0. Better dialogue.

Orca 2 7B: Instruction-following focus. Competes with bigger models.

Performance characteristics:

  • Throughput: 24-34 tokens/second on RTX 4060 Ti
  • Context window: 4K tokens
  • License: Microsoft Research License (research-only)
  • Notable: Superior instruction adherence compared to base Llama 2

Quantization Techniques

4-Bit Quantization (BitsAndBytes)

Reduces model size by 75% with minimal quality degradation. Using BitsAndBytes library:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B",
    quantization_config=bnb_config
)

Quality loss remains imperceptible for most tasks. Typical reduction from 14GB to ~4.4GB on 7B models.

8-Bit Quantization

Less aggressive than 4-bit, preserving slightly more quality. Reduces 7B models from 14GB to 7GB. Recommended when 4-bit produces insufficient quality.

GGUF Format (llama.cpp)

CPU-optimized format enabling inference without GPU acceleration. Useful for CPU-only systems or batch processing. Trade-off involves 5-10x slower throughput than GPU execution.

Installation and Runtime

Ollama Setup (Easiest Method)

Ollama abstracts quantization and hardware detection:

curl https://ollama.ai/install.sh | sh

ollama run mistral

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Ollama automatically selects quantization level based on available VRAM. Mistral defaults to 4-bit quantization on systems with 12-16GB VRAM.

Manual Setup with Transformers Library

Greater control but requires dependency management:

pip install transformers torch bitsandbytes
python -c "
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = 'mistralai/Mistral-7B'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    load_in_4bit=True
)

inputs = tokenizer('Explain quantum computing', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
"

vLLM for Production Inference

vLLM optimizes batched inference for API deployments:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mistral-7B \
  --quantization awq \
  --tensor-parallel-size 1

Serves OpenAI-compatible API endpoints with automatic batching and memory optimization.

Benchmarks and Performance

Real-world inference metrics on RTX 4060 Ti (16GB VRAM):

ModelPrecisionVRAMThroughputQuality Score
Phi-2 2.7B4-bit2.8GB110 tok/s72/100
Mistral 7B4-bit4.4GB32 tok/s85/100
Llama 2 7B4-bit3.6GB34 tok/s84/100
Openchat 3.5 7B4-bit3.5GB31 tok/s86/100
Orca 2 7B8-bit7.1GB28 tok/s87/100

Quality scores reflect performance on MMLU, HellaSwag, and GSM8K benchmarks. Differences between 7B models remain marginal.

Model selection should prioritize task-specific evaluation rather than aggregate scores. A quantized Mistral may outperform Llama 2 on the specific use case.

For API-based alternatives requiring no local hardware, see the LLM API pricing guide comparing inference costs across providers.

FAQ

Q: What GPU should I buy for running 7B models? RTX 4060 Ti (16GB) at $399 offers best value for consumer hardware. For experimentation, RTX 4090 (24GB) provides comfortable headroom. Skip RTX 4060 12GB due to memory constraints.

Q: How much quality is lost with 4-bit quantization? Imperceptible quality loss for most tasks. Benchmarks show 1-3% performance reduction. Only use 8-bit or higher if 4-bit produces unacceptable output for specific use cases.

Q: Can I run multiple models simultaneously? Yes. Two quantized 7B models consume ~7GB VRAM total on RTX 4060 Ti, leaving room for other workloads. Monitor memory usage with nvidia-smi during execution.

Q: What's faster: local inference or API calls? Local RTX 4060 Ti inference (32 tok/s) often matches cloud API latency while avoiding API costs. Trade-off involves upfront hardware investment.

Q: Does CPU-only inference work? Yes, via llama.cpp or GGML formats. Expect 0.3-0.8 tok/s on modern CPUs. Suitable only for non-latency-sensitive batch workloads.

Q: How do I serve a local model as an API? Use vLLM or text-generation-webui for OpenAI-compatible API endpoints. Deploy behind nginx for load balancing across multiple GPU instances.

Sources

  • Mistral AI Official Documentation
  • Meta Llama 2 Research Paper
  • Microsoft Phi-2 Technical Report
  • Hugging Face Model Leaderboards (March 2026)
  • vLLM Official Documentation
  • BitsAndBytes GitHub Repository