Best Ollama Models 2026: Top 15 Open-Source LLMs Ranked

Best Ollama Models: Overview
Top Ollama Models Ranked
VRAM Requirements by Model Size
Model Categories
Performance Benchmarks
Selection Criteria
Quantization and Model Optimization
Performance Benchmarks (Expanded)
Additional Models Worth Considering
Ollama vs Fine-Tuning
Installation and Hardware
FAQ
Related Resources
Sources

Best Ollama Models: Overview

Best Ollama models include Llama 3, DeepSeek R1, Mistral 7B, Phi-3, and Gemma 2. Ollama runs open-source models locally on consumer hardware. No API costs. No rate limits. No data sent to external servers. Trade-off: smaller models (7B-13B parameters) don't match GPT-4.1 or Claude reasoning, but they're free and fast. VRAM requirements scale from 4GB (Phi-3 Mini) to 32GB+ (Llama 3 70B). For comparison with hosted options, check LLM pricing and availability.

Top Ollama Models Ranked

1. Llama 3 70B

Best all-purpose large model. 70B parameters. Strong reasoning, coding, multi-lingual support. Near-GPT-4o performance on most tasks.

VRAM: 40GB minimum (quantized), 80GB+ (full precision)

Speed: 12-18 tokens/second on H100 (cloud), 2-4 tokens/second on consumer RTX 4090

Use case: Complex document analysis, code generation, multi-step reasoning, production chatbots.

Benchmark: 86.9 on MMLU (reasoning), 85% on HumanEval (coding). On par with GPT-4o Mini on most open benchmarks.

2. DeepSeek R1

Reasoning specialist. Uses a chain-of-thought approach. Shows work before answering. Slower than Llama 3 but more rigorous for math, logic, puzzle-solving.

VRAM: 32GB minimum (full), 16GB (quantized)

Speed: 4-8 tokens/second (chain-of-thought adds overhead)

Use case: Mathematical problem-solving, logic puzzles, code debugging, complex reasoning pipelines.

Benchmark: 79.8% on AIME 2024 (Pass@1), 97.3% on MATH-500. Competitive with top reasoning models on pure math and logic tasks.

3. Mistral 7B

Lightweight workhorse. 7B parameters. Strong per-token efficiency. Best instruction-following for small models.

VRAM: 4GB (quantized), 16GB (full precision)

Speed: 25-40 tokens/second on RTX 4090, 100+ tokens/second on H100 (see LLM serving framework comparison for optimization).

Use case: Chatbots, content generation, classification, prompt optimization testing, on-device deployment (laptops, phones).

Benchmark: 60% on MMLU. Inferior to Llama 3 70B but 10x faster.

4. Phi-3 Mini

Microsoft's compact model. 3.8B parameters. Trained on synthetic data. Claims performance of 7B models with 3.8B efficiency.

VRAM: 2GB (quantized), 8GB (full)

Speed: 40-80 tokens/second on RTX 4090

Use case: Edge deployment, mobile inference, quick prototypes, cost-constrained setups.

Benchmark: 69% on MMLU (claims competitive with Mistral 7B). Actual performance varies by task.

5. Gemma 2 27B

Google's mid-size model. 27B parameters. Strong coding and instruction-following. Good balance of speed and capability.

VRAM: 16GB (quantized), 54GB (full)

Speed: 8-12 tokens/second on RTX 4090

Use case: Production inference, fine-tuning base, instruction-following tasks.

Benchmark: 75% on MMLU. Better than 7B models, cheaper VRAM than 70B.

6. Code Llama 13B

Llama 2-based model specialized for code. 13B parameters. Better code generation than base Llama 2 (trained on code corpus). Language-specific variants available. Note: Llama 3 70B now outperforms Code Llama on most coding benchmarks.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Code generation, GitHub Copilot replacement, IDE integration, test case generation.

Benchmark: 57% on HumanEval. Not SOTA (Llama 3 and DeepSeek R1 are better) but specialized training.

7. Neural Chat 7B

NVIDIA's chat-optimized Mistral derivative. Faster inference than base Mistral 7B due to training optimization.

VRAM: 4GB (quantized), 16GB (full)

Speed: 30-50 tokens/second on RTX 4090

Use case: Chat interfaces, customer support bots, conversational fine-tuning.

Benchmark: 63% on MMLU. Slightly better instruction-following than vanilla Mistral.

8. Orca 2 13B

Microsoft's instruction-following model. Smaller than Llama 3 but trained specifically for multi-turn conversations and long-form generation.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Long-form content generation, conversational agents, creative writing.

Benchmark: 65% on MMLU. Below Mistral 7B on reasoning but better on conversation quality.

9. Llama 2 70B

Predecessor to Llama 3. Still viable but slower and less accurate. Only use if existing fine-tunes or workflows lock you in.

VRAM: 40GB (quantized), 80GB (full)

Speed: 10-15 tokens/second on H100

Use case: Legacy deployments, existing model weights, institutional continuity.

Benchmark: 82% on MMLU. Tangibly worse than Llama 3 on most tasks.

10. Qwen 2 72B

Alibaba's multilingual model. Strong on Asian languages (Chinese, Japanese, Korean). Similar performance to Llama 3 70B.

VRAM: 40GB (quantized), 80GB (full)

Speed: 12-18 tokens/second on H100

Use case: Multilingual applications, Chinese market, global services serving diverse languages.

Benchmark: 83% on MMLU, 98% on CMMLU (Chinese). Better than Llama 3 on CJK tasks.

11. Dolphin 2.6 (Mistral 13B)

Community-fine-tuned Mistral. Optimized for instruction-following and long-context understanding. Created by Eric Hartford.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Instruction-following, creative tasks, uncensored experimentation.

Benchmark: 67% on MMLU. Marginal improvement over base Mistral 7B.

12. Nous Hermes 2 (Mistral 13B)

Another community fine-tune. Balanced instruction-following, reasoning, creative writing. Smaller than Llama 3.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: General-purpose instruction-following without censorship concerns.

Benchmark: 65% on MMLU. Below Llama 3 but consistent across diverse tasks.

13. Mistral Large

Mistral's flagship (not yet in Ollama registry, but available). 34B parameters. Strong reasoning, multilingual, large context.

VRAM: 20GB (quantized), 68GB (full)

Speed: 6-10 tokens/second on RTX 4090

Use case: Production inference, complex reasoning, multilingual support.

Benchmark: 78% on MMLU. Approaching Llama 3 performance at lower cost than 70B.

14. Zephyr 7B

Hugging Face fine-tune of Mistral. Optimized for conversational AI. Smaller, faster alternative to Llama 3 70B for chat.

VRAM: 4GB (quantized), 16GB (full)

Speed: 25-40 tokens/second on RTX 4090

Use case: Chat interfaces, conversational fine-tuning, instruction-following.

Benchmark: 61% on MMLU. Solid for chat, weak for reasoning.

15. Starling 7B

Berkeley's Mistral fine-tune. Uses RLHF to improve instruction-following. Alternative to Zephyr with different tuning.

VRAM: 4GB (quantized), 16GB (full)

Speed: 25-40 tokens/second on RTX 4090

Use case: Instruction-following experiments, alternative to Zephyr.

Benchmark: 63% on MMLU. Marginal differences from Zephyr.

VRAM Requirements by Model Size

Model Size	Parameters	VRAM (Q4)	VRAM (F16)	Consumer GPU Match
Mini	3-4B	2-4GB	8-10GB	RTX 3060 Mobile
Small	7B	4-6GB	14-16GB	RTX 4090, RTX A5000
Medium	13B	8-10GB	26GB	RTX 4090, RTX 6000
Large	34B	18-20GB	68GB	RTX 5090, H100 PCIe
XL	70B	40-45GB	140GB	H100, 2x RTX 6000

Q4 quantization (4-bit) reduces VRAM ~75% vs full precision (F16). Ollama defaults to Q4 for most models.

Model Categories

Best for Speed (>40 tokens/second on RTX 4090)

Phi-3 Mini, Mistral 7B, Zephyr 7B. Use when latency is critical: web APIs, real-time chat, edge deployment.

Best for Quality (MMLU 75%+)

Llama 3 70B, DeepSeek R1, Mistral Large, Gemma 2 27B. Use when accuracy matters: complex reasoning, document analysis, code generation.

Best for Balance

Llama 3 70B (if VRAM available), Mistral 7B (if speed matters), Gemma 2 27B (if mid-range available).

Best for Coding

DeepSeek R1 (reasoning + code), Code Llama 13B (specialized), Llama 3 70B (general coding). For production deployment, compare with Ollama vs LM Studio.

Best for Long Context

Llama 3.1 (128K context), Llama 4 Maverick (1M context), Mistral 7B (32K), DeepSeek R1 (128K). Check Ollama model cards for exact context windows.

Best for Edge/Mobile

Phi-3 Mini (3.8B), Mistral 7B quantized (7B).

Performance Benchmarks

Reasoning (MMLU Score)

DeepSeek R1: 96 (math-heavy)
Llama 3 70B: 86.9
Mistral Large: 78
Gemma 2 27B: 75
Code Llama 13B: 71
Mistral 7B: 60
Phi-3 Mini: 69 (claimed)

Coding (HumanEval Pass Rate)

Llama 3 70B: 85%
Code Llama 13B: 57%
DeepSeek R1: 97%
Mistral 7B: 40%

Speed (Tokens/Second on RTX 4090, Q4)

Phi-3 Mini: 60 tokens/sec
Mistral 7B: 32 tokens/sec
Zephyr 7B: 28 tokens/sec
Code Llama 13B: 20 tokens/sec
Gemma 2 27B: 10 tokens/sec
Llama 3 70B: 3 tokens/sec

Inference Cost (RunPod H100, per 1M tokens)

Phi-3 Mini: $0.006 (very small)
Mistral 7B: $0.06
Gemma 2 27B: $0.18
Llama 3 70B: $0.66

Self-hosting avoids API costs but requires GPU rental or ownership.

Selection Criteria

Choose Llama 3 70B if:

VRAM or budget allows 40GB+ (cloud rental)
General-purpose reasoning and coding
Production deployment with SLA requirements

Choose DeepSeek R1 if:

Mathematical reasoning or logic puzzles critical
Time-to-answer less critical (chain-of-thought)
VRAM available (16-32GB)

Choose Mistral 7B if:

VRAM under 16GB
Speed matters (chat interfaces, APIs)
Instruction-following sufficient (not complex reasoning)

Choose Phi-3 Mini if:

Running on laptops, phones, or CPU
Bulk classification or categorization
Latency <200ms critical

Choose Gemma 2 27B if:

Sweet spot between Mistral 7B and Llama 3 70B
16GB VRAM available
Production inference with balanced performance

Quantization and Model Optimization

Quantization reduces model size and memory without meaningful accuracy loss. Ollama defaults to Q4 (4-bit quantization) for most models.

Quantization Formats Explained

Q4 (4-bit): Standard balance. Reduces VRAM by 75%. Quality loss <0.5% on benchmarks. All models support Q4. File size: ~4GB for 7B model.

Q4_K_M (4-bit with key-value cache optimization): Variant of Q4 with specialized treatment for attention keys/values. Slightly better quality (0.2% loss vs 0.5%). Same VRAM. Ollama prefers this over generic Q4.

Q5 (5-bit): Slight quality improvement over Q4 but larger file. VRAM reduction: 70%. Use if Q4 shows visible degradation. File size: ~6GB for 7B model.

Q5_K_M (5-bit with K-M optimization): Ollama's recommended 5-bit. Best quality-to-size ratio.

Q8 (8-bit): Nearly full precision, moderate VRAM savings. Rarely needed; Q4 is usually sufficient. Quality: indistinguishable from F16. File size: ~14GB for 7B model.

Q2, Q3 (extreme quantization): Aggressive compression. Quality loss: 2-5%. Use for edge devices (phones, Raspberry Pi) where VRAM is <2GB. Q2: ~1GB per 7B model.

Example: Mistral 7B at 16GB (F16) becomes 4GB (Q4_K_M). On laptop with 8GB RAM, Q4_K_M fits (4GB Mistral + 4GB OS/apps).

Q4 vs Q5 Trade-off Analysis

Speed (tokens/sec on RTX 4090):

Q4_K_M: 32 tok/sec (baseline)
Q5_K_M: 28 tok/sec (12% slower due to higher precision ops)

Quality (MMLU score on Mistral 7B):

Q4_K_M: 59.8
Q5_K_M: 60.3
FP16: 60.4

Q5 gains 0.5 points. Meaningful for reasoning-heavy tasks, imperceptible for chat.

Storage (Mistral 7B):

Q4_K_M: 4.2GB
Q5_K_M: 5.8GB

Cost per model: 1.6GB more storage. For 10 models running in a rotation, adds 16GB disk.

Recommendation: Use Q4_K_M by default. Switch to Q5_K_M if you notice accuracy issues or if accuracy directly impacts revenue. For local laptop use, Q4 is always sufficient.

Memory Calculator

VRAM needed (GB) = Model Size (B) × Quantization Bits / 8 + Overhead

Overhead = 2-4 GB for model loading, batch buffers, KV cache

Examples:

Mistral 7B Q4: 7 × 4 / 8 + 3 = 6.5 GB
Llama 3 70B Q4: 70 × 4 / 8 + 4 = 39 GB
Gemma 2 27B Q5: 27 × 5 / 8 + 3 = 20 GB

Performance Benchmarks (Expanded)

Reasoning (MMLU Score)

MMLU tests general knowledge across 57 subjects (math, science, history, etc.). Scores out of 100.

Model	Score	Improvement vs GPT-3.5	Improvement vs GPT-4
DeepSeek R1	96.3	+52%	+14%
Llama 3 70B	86.9	+40%	+3%
Mistral Large	78.0	+28%	-8%
Gemma 2 27B	75.0	+24%	-12%
Mistral 7B	60.0	+5%	-31%
Phi-3 Mini	69.0 (claimed)	+18%	-18%

DeepSeek R1 leads by showing work (chain-of-thought), trading latency for accuracy.

Coding (HumanEval Pass Rate)

HumanEval: 164 Python coding problems. Pass rate = % solved correctly.

Model	Pass Rate	vs GPT-4
DeepSeek R1	97.0%	+15%
Llama 3 70B	85.0%	+1%
Code Llama 13B	57.0%	-32%
Mistral 7B	40.0%	-52%

Code Llama underperforms Llama 3 despite specialization. Llama 3 is the better choice.

Instruction-Following (Alpaca Score)

Tests ability to follow complex instructions. Subjective eval by human raters.

Model	Score (0-10)	Notes
Llama 3 70B	9.2	Excellent follow-through
Mistral 7B	8.1	Good for size
Zephyr 7B	7.8	Community fine-tune, solid
Phi-3 Mini	7.5	Decent for 3.8B

Llama 3 70B excels at multi-step instructions. Small models (7B) struggle with complex chains.

Speed (Tokens/Sec on Consumer Hardware)

RTX 4090 (24GB VRAM):

Model	Q4 Speed	Q5 Speed	FP16 Speed
Phi-3 Mini	60 tok/sec	50 tok/sec	40 tok/sec
Mistral 7B	32 tok/sec	28 tok/sec	20 tok/sec
Zephyr 7B	28 tok/sec	24 tok/sec	18 tok/sec
Code Llama 13B	20 tok/sec	18 tok/sec	12 tok/sec
Gemma 2 27B	10 tok/sec	9 tok/sec	6 tok/sec
Llama 3 70B	Can't fit (requires 40GB+)	Can't fit	Can't fit

H100 (80GB VRAM):

Model	Q4 Speed	Q5 Speed	FP16 Speed
Llama 3 70B	100 tok/sec	85 tok/sec	120 tok/sec
Gemma 2 27B	180 tok/sec	150 tok/sec	200 tok/sec
Code Llama 70B	95 tok/sec	80 tok/sec	110 tok/sec

Quantization trades speed for VRAM. FP16 (no quantization) is fastest but requires massive VRAM.

Inference Cost (RunPod H100, per 1M tokens)

Cost = GPU hourly rate / tokens per second / 3600 seconds

Model	Throughput	Cost/1M tokens
Phi-3 Mini (RTX 4090 equiv)	60 tok/sec	$0.0058
Mistral 7B	32 tok/sec	$0.0062
Gemma 2 27B	12 tok/sec	$0.0055
Llama 3 70B	100 tok/sec	$0.0020
Llama 3 70B (RTX 4090 equiv)	3 tok/sec	$0.0664

Llama 3 70B on H100 is cheapest per token (high throughput). On consumer GPU (RTX 4090), tiny models win.

Additional Models Worth Considering

Mixtral 8x7B

Mixture-of-Experts model. Faster than Llama 3 70B for many tasks, smaller than Llama 70B.

VRAM: 24GB (quantized), 42GB (full) Speed: 15-25 tok/sec on RTX 4090 Reasoning: 71 MMLU (between Mistral 7B and Gemma 2 27B)

Use case: inference efficiency, running on RTX 4090 without compromise.

Mixtral 8x22B

Larger MoE. Closer to Llama 3 70B performance, slightly smaller VRAM footprint.

VRAM: 48GB (quantized), 88GB (full) Speed: 8-12 tok/sec on RTX 4090 (too large) Reasoning: 78 MMLU (excellent)

Use case: production inference on dual-GPU setups (2x RTX 4090 or 2x RTX A6000).

Nous Hermes 2 Mixtral

Community fine-tune of Mixtral 8x7B. Optimized for instruction-following and creative tasks.

VRAM: 24GB (quantized) Speed: same as base Mixtral 8x7B Reasoning: 73 MMLU (slight improvement over base)

Use case: alternative to Mistral 7B with better instruction-following, without penalty.

Qwen 2 7B

Alibaba's compact model. Strong multilingual support. Training data includes more Chinese/CJK.

VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 (same as Mistral 7B) Reasoning: 62 MMLU (slightly better than Mistral 7B on multilingual)

Use case: applications serving global users, especially in Asia.

Synthia 7B

Community model fine-tuned on GPT-4 outputs. Better reasoning than base 7B models.

VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 Reasoning: 65 MMLU (competitive with Mistral 7B)

Use case: reasoning tasks where Mistral 7B falls short, avoiding larger models.

Ollama vs Fine-Tuning

Base models from Ollama are general-purpose. Fine-tuning on domain-specific data improves accuracy but requires infrastructure:

Option 1: Fine-tune locally (RTX 4090, 24GB, 8 hours). Download base model, add 10K examples, output custom weights. Use in Ollama with custom model directory.

Option 2: Fine-tune on cloud (RunPod H100, $1.99/hr, 2-4 hours). Faster, more VRAM. Cost: $4-8 per fine-tune job.

Option 3: Use base models with prompt engineering. Elaborate system prompts can match 80% of fine-tuned accuracy without training cost.

Recommendation: start with base + prompt engineering. Fine-tune only if base model accuracy inadequate on 100+ test examples.

Installation and Hardware

Minimal Setup (Phi-3 Mini)

Laptop with 8GB RAM + CPU inference. Ollama downloads model in ~2 minutes.

ollama pull phi3
ollama run phi3 "What is 2+2?"

Expect 5-10 tokens/second on modern CPU.

Consumer GPU (Mistral 7B)

RTX 4090 or RTX A5000. ~$1,500-3,000 hardware cost.

ollama pull mistral:7b
ollama run mistral:7b

Expect 30+ tokens/second.

Cloud GPU (Llama 3 70B)

RunPod H100 at $1.99/hr, Lambda H100 at $2.86/hr. Monthly cloud cost: ~$150-200 for 8 hours daily usage.

ollama pull llama3:70b
ollama run llama3:70b

Expect 12-18 tokens/second.

FAQ

What is the fastest Ollama model?

Phi-3 Mini at 60+ tokens/second on RTX 4090. Mistral 7B is second at ~32 tokens/sec. Both are 7B or smaller. Llama 3 70B is 10x slower but 10x more capable.

Should I use Ollama or vLLM?

Ollama is simpler (single binary, downloads models automatically). vLLM is faster (optimized batching and tensor parallelism). For single-user inference, Ollama. For production APIs serving hundreds of requests/sec, vLLM or SGLang.

Can I fine-tune models in Ollama?

Not natively. Download the base model, fine-tune with Hugging Face Transformers or LLaMA-Factory, then add the weights to Ollama's model directory. Ollama handles inference, not training.

What's the quality difference between quantized (Q4) and full precision (F16)?

Negligible on modern models. Q4 reduces VRAM by 75%, speeds inference 20-30%, and loses <0.5% accuracy on most benchmarks. Always quantize unless testing for degradation.

Can I run Ollama on CPU?

Yes, slowly. Phi-3 Mini on CPU: ~1-2 tokens/second. Mistral 7B on CPU: ~0.2 tokens/second. GPU is recommended.

How do I use Ollama models in my application?

Ollama exposes a local API on localhost:11434. Use curl or HTTP client to send requests. OpenAI-compatible API available via third-party tools (llm-to-openai-api, ollama-openai-adapter).

Sources

Ollama Model Library
Llama 3 Model Card
DeepSeek R1 Leaderboard
Mistral AI Model Cards
Open LLM Leaderboard
MMLU Benchmark Scores
DeployBase GPU Pricing (as of March 2026)

Contents

Best Ollama Models: Overview

Top Ollama Models Ranked

1. Llama 3 70B

2. DeepSeek R1

3. Mistral 7B

4. Phi-3 Mini

5. Gemma 2 27B

6. Code Llama 13B

7. Neural Chat 7B

8. Orca 2 13B

9. Llama 2 70B

10. Qwen 2 72B

11. Dolphin 2.6 (Mistral 13B)

12. Nous Hermes 2 (Mistral 13B)

13. Mistral Large

14. Zephyr 7B

15. Starling 7B

VRAM Requirements by Model Size

Model Categories

Best for Speed (>40 tokens/second on RTX 4090)

Best for Quality (MMLU 75%+)

Best for Balance

Best for Coding

Best for Long Context

Best for Edge/Mobile

Performance Benchmarks

Reasoning (MMLU Score)

Coding (HumanEval Pass Rate)

Speed (Tokens/Second on RTX 4090, Q4)

Inference Cost (RunPod H100, per 1M tokens)

Selection Criteria

Quantization and Model Optimization

Quantization Formats Explained

Q4 vs Q5 Trade-off Analysis

Memory Calculator

Performance Benchmarks (Expanded)

Reasoning (MMLU Score)

Coding (HumanEval Pass Rate)

Instruction-Following (Alpaca Score)

Speed (Tokens/Sec on Consumer Hardware)

Inference Cost (RunPod H100, per 1M tokens)

Additional Models Worth Considering

Mixtral 8x7B

Mixtral 8x22B

Nous Hermes 2 Mixtral

Qwen 2 7B

Synthia 7B

Ollama vs Fine-Tuning

Installation and Hardware

Minimal Setup (Phi-3 Mini)

Consumer GPU (Mistral 7B)

Cloud GPU (Llama 3 70B)

FAQ

Related Resources

Sources