Contents
- Best Ollama Models: Overview
- Top Ollama Models Ranked
- VRAM Requirements by Model Size
- Model Categories
- Performance Benchmarks
- Selection Criteria
- Quantization and Model Optimization
- Performance Benchmarks (Expanded)
- Additional Models Worth Considering
- Ollama vs Fine-Tuning
- Installation and Hardware
- FAQ
- Related Resources
- Sources
Best Ollama Models: Overview
Best Ollama models include Llama 3, DeepSeek R1, Mistral 7B, Phi-3, and Gemma 2. Ollama runs open-source models locally on consumer hardware. No API costs. No rate limits. No data sent to external servers. Trade-off: smaller models (7B-13B parameters) don't match GPT-4.1 or Claude reasoning, but they're free and fast. VRAM requirements scale from 4GB (Phi-3 Mini) to 32GB+ (Llama 3 70B). For comparison with hosted options, check LLM pricing and availability.
Top Ollama Models Ranked
1. Llama 3 70B
Best all-purpose large model. 70B parameters. Strong reasoning, coding, multi-lingual support. Near-GPT-4o performance on most tasks.
VRAM: 40GB minimum (quantized), 80GB+ (full precision)
Speed: 12-18 tokens/second on H100 (cloud), 2-4 tokens/second on consumer RTX 4090
Use case: Complex document analysis, code generation, multi-step reasoning, production chatbots.
Benchmark: 86.9 on MMLU (reasoning), 85% on HumanEval (coding). On par with GPT-4o Mini on most open benchmarks.
2. DeepSeek R1
Reasoning specialist. Uses a chain-of-thought approach. Shows work before answering. Slower than Llama 3 but more rigorous for math, logic, puzzle-solving.
VRAM: 32GB minimum (full), 16GB (quantized)
Speed: 4-8 tokens/second (chain-of-thought adds overhead)
Use case: Mathematical problem-solving, logic puzzles, code debugging, complex reasoning pipelines.
Benchmark: 79.8% on AIME 2024 (Pass@1), 97.3% on MATH-500. Competitive with top reasoning models on pure math and logic tasks.
3. Mistral 7B
Lightweight workhorse. 7B parameters. Strong per-token efficiency. Best instruction-following for small models.
VRAM: 4GB (quantized), 16GB (full precision)
Speed: 25-40 tokens/second on RTX 4090, 100+ tokens/second on H100 (see LLM serving framework comparison for optimization).
Use case: Chatbots, content generation, classification, prompt optimization testing, on-device deployment (laptops, phones).
Benchmark: 60% on MMLU. Inferior to Llama 3 70B but 10x faster.
4. Phi-3 Mini
Microsoft's compact model. 3.8B parameters. Trained on synthetic data. Claims performance of 7B models with 3.8B efficiency.
VRAM: 2GB (quantized), 8GB (full)
Speed: 40-80 tokens/second on RTX 4090
Use case: Edge deployment, mobile inference, quick prototypes, cost-constrained setups.
Benchmark: 69% on MMLU (claims competitive with Mistral 7B). Actual performance varies by task.
5. Gemma 2 27B
Google's mid-size model. 27B parameters. Strong coding and instruction-following. Good balance of speed and capability.
VRAM: 16GB (quantized), 54GB (full)
Speed: 8-12 tokens/second on RTX 4090
Use case: Production inference, fine-tuning base, instruction-following tasks.
Benchmark: 75% on MMLU. Better than 7B models, cheaper VRAM than 70B.
6. Code Llama 13B
Llama 2-based model specialized for code. 13B parameters. Better code generation than base Llama 2 (trained on code corpus). Language-specific variants available. Note: Llama 3 70B now outperforms Code Llama on most coding benchmarks.
VRAM: 8GB (quantized), 26GB (full)
Speed: 18-25 tokens/second on RTX 4090
Use case: Code generation, GitHub Copilot replacement, IDE integration, test case generation.
Benchmark: 57% on HumanEval. Not SOTA (Llama 3 and DeepSeek R1 are better) but specialized training.
7. Neural Chat 7B
NVIDIA's chat-optimized Mistral derivative. Faster inference than base Mistral 7B due to training optimization.
VRAM: 4GB (quantized), 16GB (full)
Speed: 30-50 tokens/second on RTX 4090
Use case: Chat interfaces, customer support bots, conversational fine-tuning.
Benchmark: 63% on MMLU. Slightly better instruction-following than vanilla Mistral.
8. Orca 2 13B
Microsoft's instruction-following model. Smaller than Llama 3 but trained specifically for multi-turn conversations and long-form generation.
VRAM: 8GB (quantized), 26GB (full)
Speed: 18-25 tokens/second on RTX 4090
Use case: Long-form content generation, conversational agents, creative writing.
Benchmark: 65% on MMLU. Below Mistral 7B on reasoning but better on conversation quality.
9. Llama 2 70B
Predecessor to Llama 3. Still viable but slower and less accurate. Only use if existing fine-tunes or workflows lock developers in.
VRAM: 40GB (quantized), 80GB (full)
Speed: 10-15 tokens/second on H100
Use case: Legacy deployments, existing model weights, institutional continuity.
Benchmark: 82% on MMLU. Tangibly worse than Llama 3 on most tasks.
10. Qwen 2 72B
Alibaba's multilingual model. Strong on Asian languages (Chinese, Japanese, Korean). Similar performance to Llama 3 70B.
VRAM: 40GB (quantized), 80GB (full)
Speed: 12-18 tokens/second on H100
Use case: Multilingual applications, Chinese market, global services serving diverse languages.
Benchmark: 83% on MMLU, 98% on CMMLU (Chinese). Better than Llama 3 on CJK tasks.
11. Dolphin 2.6 (Mistral 13B)
Community-fine-tuned Mistral. Optimized for instruction-following and long-context understanding. Created by Eric Hartford.
VRAM: 8GB (quantized), 26GB (full)
Speed: 18-25 tokens/second on RTX 4090
Use case: Instruction-following, creative tasks, uncensored experimentation.
Benchmark: 67% on MMLU. Marginal improvement over base Mistral 7B.
12. Nous Hermes 2 (Mistral 13B)
Another community fine-tune. Balanced instruction-following, reasoning, creative writing. Smaller than Llama 3.
VRAM: 8GB (quantized), 26GB (full)
Speed: 18-25 tokens/second on RTX 4090
Use case: General-purpose instruction-following without censorship concerns.
Benchmark: 65% on MMLU. Below Llama 3 but consistent across diverse tasks.
13. Mistral Large
Mistral's flagship (not yet in Ollama registry, but available). 34B parameters. Strong reasoning, multilingual, large context.
VRAM: 20GB (quantized), 68GB (full)
Speed: 6-10 tokens/second on RTX 4090
Use case: Production inference, complex reasoning, multilingual support.
Benchmark: 78% on MMLU. Approaching Llama 3 performance at lower cost than 70B.
14. Zephyr 7B
Hugging Face fine-tune of Mistral. Optimized for conversational AI. Smaller, faster alternative to Llama 3 70B for chat.
VRAM: 4GB (quantized), 16GB (full)
Speed: 25-40 tokens/second on RTX 4090
Use case: Chat interfaces, conversational fine-tuning, instruction-following.
Benchmark: 61% on MMLU. Solid for chat, weak for reasoning.
15. Starling 7B
Berkeley's Mistral fine-tune. Uses RLHF to improve instruction-following. Alternative to Zephyr with different tuning.
VRAM: 4GB (quantized), 16GB (full)
Speed: 25-40 tokens/second on RTX 4090
Use case: Instruction-following experiments, alternative to Zephyr.
Benchmark: 63% on MMLU. Marginal differences from Zephyr.
VRAM Requirements by Model Size
| Model Size | Parameters | VRAM (Q4) | VRAM (F16) | Consumer GPU Match |
|---|---|---|---|---|
| Mini | 3-4B | 2-4GB | 8-10GB | RTX 3060 Mobile |
| Small | 7B | 4-6GB | 14-16GB | RTX 4090, RTX A5000 |
| Medium | 13B | 8-10GB | 26GB | RTX 4090, RTX 6000 |
| Large | 34B | 18-20GB | 68GB | RTX 5090, H100 PCIe |
| XL | 70B | 40-45GB | 140GB | H100, 2x RTX 6000 |
Q4 quantization (4-bit) reduces VRAM ~75% vs full precision (F16). Ollama defaults to Q4 for most models.
Model Categories
Best for Speed (>40 tokens/second on RTX 4090)
Phi-3 Mini, Mistral 7B, Zephyr 7B. Use when latency is critical: web APIs, real-time chat, edge deployment.
Best for Quality (MMLU 75%+)
Llama 3 70B, DeepSeek R1, Mistral Large, Gemma 2 27B. Use when accuracy matters: complex reasoning, document analysis, code generation.
Best for Balance
Llama 3 70B (if VRAM available), Mistral 7B (if speed matters), Gemma 2 27B (if mid-range available).
Best for Coding
DeepSeek R1 (reasoning + code), Code Llama 13B (specialized), Llama 3 70B (general coding). For production deployment, compare with Ollama vs LM Studio.
Best for Long Context
Llama 3.1 (128K context), Llama 4 Maverick (1M context), Mistral 7B (32K), DeepSeek R1 (128K). Check Ollama model cards for exact context windows.
Best for Edge/Mobile
Phi-3 Mini (3.8B), Mistral 7B quantized (7B).
Performance Benchmarks
Reasoning (MMLU Score)
- DeepSeek R1: 96 (math-heavy)
- Llama 3 70B: 86.9
- Mistral Large: 78
- Gemma 2 27B: 75
- Code Llama 13B: 71
- Mistral 7B: 60
- Phi-3 Mini: 69 (claimed)
Coding (HumanEval Pass Rate)
- Llama 3 70B: 85%
- Code Llama 13B: 57%
- DeepSeek R1: 97%
- Mistral 7B: 40%
Speed (Tokens/Second on RTX 4090, Q4)
- Phi-3 Mini: 60 tokens/sec
- Mistral 7B: 32 tokens/sec
- Zephyr 7B: 28 tokens/sec
- Code Llama 13B: 20 tokens/sec
- Gemma 2 27B: 10 tokens/sec
- Llama 3 70B: 3 tokens/sec
Inference Cost (RunPod H100, per 1M tokens)
- Phi-3 Mini: $0.006 (very small)
- Mistral 7B: $0.06
- Gemma 2 27B: $0.18
- Llama 3 70B: $0.66
Self-hosting avoids API costs but requires GPU rental or ownership.
Selection Criteria
Choose Llama 3 70B if:
- VRAM or budget allows 40GB+ (cloud rental)
- General-purpose reasoning and coding
- Production deployment with SLA requirements
Choose DeepSeek R1 if:
- Mathematical reasoning or logic puzzles critical
- Time-to-answer less critical (chain-of-thought)
- VRAM available (16-32GB)
Choose Mistral 7B if:
- VRAM under 16GB
- Speed matters (chat interfaces, APIs)
- Instruction-following sufficient (not complex reasoning)
Choose Phi-3 Mini if:
- Running on laptops, phones, or CPU
- Bulk classification or categorization
- Latency <200ms critical
Choose Gemma 2 27B if:
- Sweet spot between Mistral 7B and Llama 3 70B
- 16GB VRAM available
- Production inference with balanced performance
Quantization and Model Optimization
Quantization reduces model size and memory without meaningful accuracy loss. Ollama defaults to Q4 (4-bit quantization) for most models.
Quantization Formats Explained
Q4 (4-bit): Standard balance. Reduces VRAM by 75%. Quality loss <0.5% on benchmarks. All models support Q4. File size: ~4GB for 7B model.
Q4_K_M (4-bit with key-value cache optimization): Variant of Q4 with specialized treatment for attention keys/values. Slightly better quality (0.2% loss vs 0.5%). Same VRAM. Ollama prefers this over generic Q4.
Q5 (5-bit): Slight quality improvement over Q4 but larger file. VRAM reduction: 70%. Use if Q4 shows visible degradation. File size: ~6GB for 7B model.
Q5_K_M (5-bit with K-M optimization): Ollama's recommended 5-bit. Best quality-to-size ratio.
Q8 (8-bit): Nearly full precision, moderate VRAM savings. Rarely needed; Q4 is usually sufficient. Quality: indistinguishable from F16. File size: ~14GB for 7B model.
Q2, Q3 (extreme quantization): Aggressive compression. Quality loss: 2-5%. Use for edge devices (phones, Raspberry Pi) where VRAM is <2GB. Q2: ~1GB per 7B model.
Example: Mistral 7B at 16GB (F16) becomes 4GB (Q4_K_M). On laptop with 8GB RAM, Q4_K_M fits (4GB Mistral + 4GB OS/apps).
Q4 vs Q5 Trade-off Analysis
Speed (tokens/sec on RTX 4090):
- Q4_K_M: 32 tok/sec (baseline)
- Q5_K_M: 28 tok/sec (12% slower due to higher precision ops)
Quality (MMLU score on Mistral 7B):
- Q4_K_M: 59.8
- Q5_K_M: 60.3
- FP16: 60.4
Q5 gains 0.5 points. Meaningful for reasoning-heavy tasks, imperceptible for chat.
Storage (Mistral 7B):
- Q4_K_M: 4.2GB
- Q5_K_M: 5.8GB
Cost per model: 1.6GB more storage. For 10 models running in a rotation, adds 16GB disk.
Recommendation: Use Q4_K_M by default. Switch to Q5_K_M if developers notice accuracy issues or if accuracy directly impacts revenue. For local laptop use, Q4 is always sufficient.
Memory Calculator
VRAM needed (GB) = Model Size (B) × Quantization Bits / 8 + Overhead
Overhead = 2-4 GB for model loading, batch buffers, KV cache
Examples:
- Mistral 7B Q4: 7 × 4 / 8 + 3 = 6.5 GB
- Llama 3 70B Q4: 70 × 4 / 8 + 4 = 39 GB
- Gemma 2 27B Q5: 27 × 5 / 8 + 3 = 20 GB
Performance Benchmarks (Expanded)
Reasoning (MMLU Score)
MMLU tests general knowledge across 57 subjects (math, science, history, etc.). Scores out of 100.
| Model | Score | Improvement vs GPT-3.5 | Improvement vs GPT-4 |
|---|---|---|---|
| DeepSeek R1 | 96.3 | +52% | +14% |
| Llama 3 70B | 86.9 | +40% | +3% |
| Mistral Large | 78.0 | +28% | -8% |
| Gemma 2 27B | 75.0 | +24% | -12% |
| Mistral 7B | 60.0 | +5% | -31% |
| Phi-3 Mini | 69.0 (claimed) | +18% | -18% |
DeepSeek R1 leads by showing work (chain-of-thought), trading latency for accuracy.
Coding (HumanEval Pass Rate)
HumanEval: 164 Python coding problems. Pass rate = % solved correctly.
| Model | Pass Rate | vs GPT-4 |
|---|---|---|
| DeepSeek R1 | 97.0% | +15% |
| Llama 3 70B | 85.0% | +1% |
| Code Llama 13B | 57.0% | -32% |
| Mistral 7B | 40.0% | -52% |
Code Llama underperforms Llama 3 despite specialization. Llama 3 is the better choice.
Instruction-Following (Alpaca Score)
Tests ability to follow complex instructions. Subjective eval by human raters.
| Model | Score (0-10) | Notes |
|---|---|---|
| Llama 3 70B | 9.2 | Excellent follow-through |
| Mistral 7B | 8.1 | Good for size |
| Zephyr 7B | 7.8 | Community fine-tune, solid |
| Phi-3 Mini | 7.5 | Decent for 3.8B |
Llama 3 70B excels at multi-step instructions. Small models (7B) struggle with complex chains.
Speed (Tokens/Sec on Consumer Hardware)
RTX 4090 (24GB VRAM):
| Model | Q4 Speed | Q5 Speed | FP16 Speed |
|---|---|---|---|
| Phi-3 Mini | 60 tok/sec | 50 tok/sec | 40 tok/sec |
| Mistral 7B | 32 tok/sec | 28 tok/sec | 20 tok/sec |
| Zephyr 7B | 28 tok/sec | 24 tok/sec | 18 tok/sec |
| Code Llama 13B | 20 tok/sec | 18 tok/sec | 12 tok/sec |
| Gemma 2 27B | 10 tok/sec | 9 tok/sec | 6 tok/sec |
| Llama 3 70B | Can't fit (requires 40GB+) | Can't fit | Can't fit |
H100 (80GB VRAM):
| Model | Q4 Speed | Q5 Speed | FP16 Speed |
|---|---|---|---|
| Llama 3 70B | 100 tok/sec | 85 tok/sec | 120 tok/sec |
| Gemma 2 27B | 180 tok/sec | 150 tok/sec | 200 tok/sec |
| Code Llama 70B | 95 tok/sec | 80 tok/sec | 110 tok/sec |
Quantization trades speed for VRAM. FP16 (no quantization) is fastest but requires massive VRAM.
Inference Cost (RunPod H100, per 1M tokens)
Cost = GPU hourly rate / tokens per second / 3600 seconds
| Model | Throughput | Cost/1M tokens |
|---|---|---|
| Phi-3 Mini (RTX 4090 equiv) | 60 tok/sec | $0.0058 |
| Mistral 7B | 32 tok/sec | $0.0062 |
| Gemma 2 27B | 12 tok/sec | $0.0055 |
| Llama 3 70B | 100 tok/sec | $0.0020 |
| Llama 3 70B (RTX 4090 equiv) | 3 tok/sec | $0.0664 |
Llama 3 70B on H100 is cheapest per token (high throughput). On consumer GPU (RTX 4090), tiny models win.
Additional Models Worth Considering
Mixtral 8x7B
Mixture-of-Experts model. Faster than Llama 3 70B for many tasks, smaller than Llama 70B.
VRAM: 24GB (quantized), 42GB (full) Speed: 15-25 tok/sec on RTX 4090 Reasoning: 71 MMLU (between Mistral 7B and Gemma 2 27B)
Use case: inference efficiency, running on RTX 4090 without compromise.
Mixtral 8x22B
Larger MoE. Closer to Llama 3 70B performance, slightly smaller VRAM footprint.
VRAM: 48GB (quantized), 88GB (full) Speed: 8-12 tok/sec on RTX 4090 (too large) Reasoning: 78 MMLU (excellent)
Use case: production inference on dual-GPU setups (2x RTX 4090 or 2x RTX A6000).
Nous Hermes 2 Mixtral
Community fine-tune of Mixtral 8x7B. Optimized for instruction-following and creative tasks.
VRAM: 24GB (quantized) Speed: same as base Mixtral 8x7B Reasoning: 73 MMLU (slight improvement over base)
Use case: alternative to Mistral 7B with better instruction-following, without penalty.
Qwen 2 7B
Alibaba's compact model. Strong multilingual support. Training data includes more Chinese/CJK.
VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 (same as Mistral 7B) Reasoning: 62 MMLU (slightly better than Mistral 7B on multilingual)
Use case: applications serving global users, especially in Asia.
Synthia 7B
Community model fine-tuned on GPT-4 outputs. Better reasoning than base 7B models.
VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 Reasoning: 65 MMLU (competitive with Mistral 7B)
Use case: reasoning tasks where Mistral 7B falls short, avoiding larger models.
Ollama vs Fine-Tuning
Base models from Ollama are general-purpose. Fine-tuning on domain-specific data improves accuracy but requires infrastructure:
Option 1: Fine-tune locally (RTX 4090, 24GB, 8 hours). Download base model, add 10K examples, output custom weights. Use in Ollama with custom model directory.
Option 2: Fine-tune on cloud (RunPod H100, $1.99/hr, 2-4 hours). Faster, more VRAM. Cost: $4-8 per fine-tune job.
Option 3: Use base models with prompt engineering. Elaborate system prompts can match 80% of fine-tuned accuracy without training cost.
Recommendation: start with base + prompt engineering. Fine-tune only if base model accuracy inadequate on 100+ test examples.
Installation and Hardware
Minimal Setup (Phi-3 Mini)
Laptop with 8GB RAM + CPU inference. Ollama downloads model in ~2 minutes.
ollama pull phi3
ollama run phi3 "What is 2+2?"
Expect 5-10 tokens/second on modern CPU.
Consumer GPU (Mistral 7B)
RTX 4090 or RTX A5000. ~$1,500-3,000 hardware cost.
ollama pull mistral:7b
ollama run mistral:7b
Expect 30+ tokens/second.
Cloud GPU (Llama 3 70B)
RunPod H100 at $1.99/hr, Lambda H100 at $2.86/hr. Monthly cloud cost: ~$150-200 for 8 hours daily usage.
ollama pull llama3:70b
ollama run llama3:70b
Expect 12-18 tokens/second.
FAQ
What is the fastest Ollama model?
Phi-3 Mini at 60+ tokens/second on RTX 4090. Mistral 7B is second at ~32 tokens/sec. Both are 7B or smaller. Llama 3 70B is 10x slower but 10x more capable.
Should I use Ollama or vLLM?
Ollama is simpler (single binary, downloads models automatically). vLLM is faster (optimized batching and tensor parallelism). For single-user inference, Ollama. For production APIs serving hundreds of requests/sec, vLLM or SGLang.
Can I fine-tune models in Ollama?
Not natively. Download the base model, fine-tune with Hugging Face Transformers or LLaMA-Factory, then add the weights to Ollama's model directory. Ollama handles inference, not training.
What's the quality difference between quantized (Q4) and full precision (F16)?
Negligible on modern models. Q4 reduces VRAM by 75%, speeds inference 20-30%, and loses <0.5% accuracy on most benchmarks. Always quantize unless testing for degradation.
Can I run Ollama on CPU?
Yes, slowly. Phi-3 Mini on CPU: ~1-2 tokens/second. Mistral 7B on CPU: ~0.2 tokens/second. GPU is recommended.
How do I use Ollama models in my application?
Ollama exposes a local API on localhost:11434. Use curl or HTTP client to send requests. OpenAI-compatible API available via third-party tools (llm-to-openai-api, ollama-openai-adapter).