Best Ollama Models 2026: Top 15 Open-Source LLMs Ranked

Deploybase · February 16, 2026 · LLM Guides

Contents


Best Ollama Models: Overview

Best Ollama models include Llama 3, DeepSeek R1, Mistral 7B, Phi-3, and Gemma 2. Ollama runs open-source models locally on consumer hardware. No API costs. No rate limits. No data sent to external servers. Trade-off: smaller models (7B-13B parameters) don't match GPT-4.1 or Claude reasoning, but they're free and fast. VRAM requirements scale from 4GB (Phi-3 Mini) to 32GB+ (Llama 3 70B). For comparison with hosted options, check LLM pricing and availability.


Top Ollama Models Ranked

1. Llama 3 70B

Best all-purpose large model. 70B parameters. Strong reasoning, coding, multi-lingual support. Near-GPT-4o performance on most tasks.

VRAM: 40GB minimum (quantized), 80GB+ (full precision)

Speed: 12-18 tokens/second on H100 (cloud), 2-4 tokens/second on consumer RTX 4090

Use case: Complex document analysis, code generation, multi-step reasoning, production chatbots.

Benchmark: 86.9 on MMLU (reasoning), 85% on HumanEval (coding). On par with GPT-4o Mini on most open benchmarks.

2. DeepSeek R1

Reasoning specialist. Uses a chain-of-thought approach. Shows work before answering. Slower than Llama 3 but more rigorous for math, logic, puzzle-solving.

VRAM: 32GB minimum (full), 16GB (quantized)

Speed: 4-8 tokens/second (chain-of-thought adds overhead)

Use case: Mathematical problem-solving, logic puzzles, code debugging, complex reasoning pipelines.

Benchmark: 79.8% on AIME 2024 (Pass@1), 97.3% on MATH-500. Competitive with top reasoning models on pure math and logic tasks.

3. Mistral 7B

Lightweight workhorse. 7B parameters. Strong per-token efficiency. Best instruction-following for small models.

VRAM: 4GB (quantized), 16GB (full precision)

Speed: 25-40 tokens/second on RTX 4090, 100+ tokens/second on H100 (see LLM serving framework comparison for optimization).

Use case: Chatbots, content generation, classification, prompt optimization testing, on-device deployment (laptops, phones).

Benchmark: 60% on MMLU. Inferior to Llama 3 70B but 10x faster.

4. Phi-3 Mini

Microsoft's compact model. 3.8B parameters. Trained on synthetic data. Claims performance of 7B models with 3.8B efficiency.

VRAM: 2GB (quantized), 8GB (full)

Speed: 40-80 tokens/second on RTX 4090

Use case: Edge deployment, mobile inference, quick prototypes, cost-constrained setups.

Benchmark: 69% on MMLU (claims competitive with Mistral 7B). Actual performance varies by task.

5. Gemma 2 27B

Google's mid-size model. 27B parameters. Strong coding and instruction-following. Good balance of speed and capability.

VRAM: 16GB (quantized), 54GB (full)

Speed: 8-12 tokens/second on RTX 4090

Use case: Production inference, fine-tuning base, instruction-following tasks.

Benchmark: 75% on MMLU. Better than 7B models, cheaper VRAM than 70B.

6. Code Llama 13B

Llama 2-based model specialized for code. 13B parameters. Better code generation than base Llama 2 (trained on code corpus). Language-specific variants available. Note: Llama 3 70B now outperforms Code Llama on most coding benchmarks.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Code generation, GitHub Copilot replacement, IDE integration, test case generation.

Benchmark: 57% on HumanEval. Not SOTA (Llama 3 and DeepSeek R1 are better) but specialized training.

7. Neural Chat 7B

NVIDIA's chat-optimized Mistral derivative. Faster inference than base Mistral 7B due to training optimization.

VRAM: 4GB (quantized), 16GB (full)

Speed: 30-50 tokens/second on RTX 4090

Use case: Chat interfaces, customer support bots, conversational fine-tuning.

Benchmark: 63% on MMLU. Slightly better instruction-following than vanilla Mistral.

8. Orca 2 13B

Microsoft's instruction-following model. Smaller than Llama 3 but trained specifically for multi-turn conversations and long-form generation.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Long-form content generation, conversational agents, creative writing.

Benchmark: 65% on MMLU. Below Mistral 7B on reasoning but better on conversation quality.

9. Llama 2 70B

Predecessor to Llama 3. Still viable but slower and less accurate. Only use if existing fine-tunes or workflows lock developers in.

VRAM: 40GB (quantized), 80GB (full)

Speed: 10-15 tokens/second on H100

Use case: Legacy deployments, existing model weights, institutional continuity.

Benchmark: 82% on MMLU. Tangibly worse than Llama 3 on most tasks.

10. Qwen 2 72B

Alibaba's multilingual model. Strong on Asian languages (Chinese, Japanese, Korean). Similar performance to Llama 3 70B.

VRAM: 40GB (quantized), 80GB (full)

Speed: 12-18 tokens/second on H100

Use case: Multilingual applications, Chinese market, global services serving diverse languages.

Benchmark: 83% on MMLU, 98% on CMMLU (Chinese). Better than Llama 3 on CJK tasks.

11. Dolphin 2.6 (Mistral 13B)

Community-fine-tuned Mistral. Optimized for instruction-following and long-context understanding. Created by Eric Hartford.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: Instruction-following, creative tasks, uncensored experimentation.

Benchmark: 67% on MMLU. Marginal improvement over base Mistral 7B.

12. Nous Hermes 2 (Mistral 13B)

Another community fine-tune. Balanced instruction-following, reasoning, creative writing. Smaller than Llama 3.

VRAM: 8GB (quantized), 26GB (full)

Speed: 18-25 tokens/second on RTX 4090

Use case: General-purpose instruction-following without censorship concerns.

Benchmark: 65% on MMLU. Below Llama 3 but consistent across diverse tasks.

13. Mistral Large

Mistral's flagship (not yet in Ollama registry, but available). 34B parameters. Strong reasoning, multilingual, large context.

VRAM: 20GB (quantized), 68GB (full)

Speed: 6-10 tokens/second on RTX 4090

Use case: Production inference, complex reasoning, multilingual support.

Benchmark: 78% on MMLU. Approaching Llama 3 performance at lower cost than 70B.

14. Zephyr 7B

Hugging Face fine-tune of Mistral. Optimized for conversational AI. Smaller, faster alternative to Llama 3 70B for chat.

VRAM: 4GB (quantized), 16GB (full)

Speed: 25-40 tokens/second on RTX 4090

Use case: Chat interfaces, conversational fine-tuning, instruction-following.

Benchmark: 61% on MMLU. Solid for chat, weak for reasoning.

15. Starling 7B

Berkeley's Mistral fine-tune. Uses RLHF to improve instruction-following. Alternative to Zephyr with different tuning.

VRAM: 4GB (quantized), 16GB (full)

Speed: 25-40 tokens/second on RTX 4090

Use case: Instruction-following experiments, alternative to Zephyr.

Benchmark: 63% on MMLU. Marginal differences from Zephyr.


VRAM Requirements by Model Size

Model SizeParametersVRAM (Q4)VRAM (F16)Consumer GPU Match
Mini3-4B2-4GB8-10GBRTX 3060 Mobile
Small7B4-6GB14-16GBRTX 4090, RTX A5000
Medium13B8-10GB26GBRTX 4090, RTX 6000
Large34B18-20GB68GBRTX 5090, H100 PCIe
XL70B40-45GB140GBH100, 2x RTX 6000

Q4 quantization (4-bit) reduces VRAM ~75% vs full precision (F16). Ollama defaults to Q4 for most models.


Model Categories

Best for Speed (>40 tokens/second on RTX 4090)

Phi-3 Mini, Mistral 7B, Zephyr 7B. Use when latency is critical: web APIs, real-time chat, edge deployment.

Best for Quality (MMLU 75%+)

Llama 3 70B, DeepSeek R1, Mistral Large, Gemma 2 27B. Use when accuracy matters: complex reasoning, document analysis, code generation.

Best for Balance

Llama 3 70B (if VRAM available), Mistral 7B (if speed matters), Gemma 2 27B (if mid-range available).

Best for Coding

DeepSeek R1 (reasoning + code), Code Llama 13B (specialized), Llama 3 70B (general coding). For production deployment, compare with Ollama vs LM Studio.

Best for Long Context

Llama 3.1 (128K context), Llama 4 Maverick (1M context), Mistral 7B (32K), DeepSeek R1 (128K). Check Ollama model cards for exact context windows.

Best for Edge/Mobile

Phi-3 Mini (3.8B), Mistral 7B quantized (7B).


Performance Benchmarks

Reasoning (MMLU Score)

  • DeepSeek R1: 96 (math-heavy)
  • Llama 3 70B: 86.9
  • Mistral Large: 78
  • Gemma 2 27B: 75
  • Code Llama 13B: 71
  • Mistral 7B: 60
  • Phi-3 Mini: 69 (claimed)

Coding (HumanEval Pass Rate)

  • Llama 3 70B: 85%
  • Code Llama 13B: 57%
  • DeepSeek R1: 97%
  • Mistral 7B: 40%

Speed (Tokens/Second on RTX 4090, Q4)

  • Phi-3 Mini: 60 tokens/sec
  • Mistral 7B: 32 tokens/sec
  • Zephyr 7B: 28 tokens/sec
  • Code Llama 13B: 20 tokens/sec
  • Gemma 2 27B: 10 tokens/sec
  • Llama 3 70B: 3 tokens/sec

Inference Cost (RunPod H100, per 1M tokens)

  • Phi-3 Mini: $0.006 (very small)
  • Mistral 7B: $0.06
  • Gemma 2 27B: $0.18
  • Llama 3 70B: $0.66

Self-hosting avoids API costs but requires GPU rental or ownership.


Selection Criteria

Choose Llama 3 70B if:

  • VRAM or budget allows 40GB+ (cloud rental)
  • General-purpose reasoning and coding
  • Production deployment with SLA requirements

Choose DeepSeek R1 if:

  • Mathematical reasoning or logic puzzles critical
  • Time-to-answer less critical (chain-of-thought)
  • VRAM available (16-32GB)

Choose Mistral 7B if:

  • VRAM under 16GB
  • Speed matters (chat interfaces, APIs)
  • Instruction-following sufficient (not complex reasoning)

Choose Phi-3 Mini if:

  • Running on laptops, phones, or CPU
  • Bulk classification or categorization
  • Latency <200ms critical

Choose Gemma 2 27B if:

  • Sweet spot between Mistral 7B and Llama 3 70B
  • 16GB VRAM available
  • Production inference with balanced performance

Quantization and Model Optimization

Quantization reduces model size and memory without meaningful accuracy loss. Ollama defaults to Q4 (4-bit quantization) for most models.

Quantization Formats Explained

Q4 (4-bit): Standard balance. Reduces VRAM by 75%. Quality loss <0.5% on benchmarks. All models support Q4. File size: ~4GB for 7B model.

Q4_K_M (4-bit with key-value cache optimization): Variant of Q4 with specialized treatment for attention keys/values. Slightly better quality (0.2% loss vs 0.5%). Same VRAM. Ollama prefers this over generic Q4.

Q5 (5-bit): Slight quality improvement over Q4 but larger file. VRAM reduction: 70%. Use if Q4 shows visible degradation. File size: ~6GB for 7B model.

Q5_K_M (5-bit with K-M optimization): Ollama's recommended 5-bit. Best quality-to-size ratio.

Q8 (8-bit): Nearly full precision, moderate VRAM savings. Rarely needed; Q4 is usually sufficient. Quality: indistinguishable from F16. File size: ~14GB for 7B model.

Q2, Q3 (extreme quantization): Aggressive compression. Quality loss: 2-5%. Use for edge devices (phones, Raspberry Pi) where VRAM is <2GB. Q2: ~1GB per 7B model.

Example: Mistral 7B at 16GB (F16) becomes 4GB (Q4_K_M). On laptop with 8GB RAM, Q4_K_M fits (4GB Mistral + 4GB OS/apps).

Q4 vs Q5 Trade-off Analysis

Speed (tokens/sec on RTX 4090):

  • Q4_K_M: 32 tok/sec (baseline)
  • Q5_K_M: 28 tok/sec (12% slower due to higher precision ops)

Quality (MMLU score on Mistral 7B):

  • Q4_K_M: 59.8
  • Q5_K_M: 60.3
  • FP16: 60.4

Q5 gains 0.5 points. Meaningful for reasoning-heavy tasks, imperceptible for chat.

Storage (Mistral 7B):

  • Q4_K_M: 4.2GB
  • Q5_K_M: 5.8GB

Cost per model: 1.6GB more storage. For 10 models running in a rotation, adds 16GB disk.

Recommendation: Use Q4_K_M by default. Switch to Q5_K_M if developers notice accuracy issues or if accuracy directly impacts revenue. For local laptop use, Q4 is always sufficient.

Memory Calculator

VRAM needed (GB) = Model Size (B) × Quantization Bits / 8 + Overhead

Overhead = 2-4 GB for model loading, batch buffers, KV cache

Examples:

  • Mistral 7B Q4: 7 × 4 / 8 + 3 = 6.5 GB
  • Llama 3 70B Q4: 70 × 4 / 8 + 4 = 39 GB
  • Gemma 2 27B Q5: 27 × 5 / 8 + 3 = 20 GB

Performance Benchmarks (Expanded)

Reasoning (MMLU Score)

MMLU tests general knowledge across 57 subjects (math, science, history, etc.). Scores out of 100.

ModelScoreImprovement vs GPT-3.5Improvement vs GPT-4
DeepSeek R196.3+52%+14%
Llama 3 70B86.9+40%+3%
Mistral Large78.0+28%-8%
Gemma 2 27B75.0+24%-12%
Mistral 7B60.0+5%-31%
Phi-3 Mini69.0 (claimed)+18%-18%

DeepSeek R1 leads by showing work (chain-of-thought), trading latency for accuracy.

Coding (HumanEval Pass Rate)

HumanEval: 164 Python coding problems. Pass rate = % solved correctly.

ModelPass Ratevs GPT-4
DeepSeek R197.0%+15%
Llama 3 70B85.0%+1%
Code Llama 13B57.0%-32%
Mistral 7B40.0%-52%

Code Llama underperforms Llama 3 despite specialization. Llama 3 is the better choice.

Instruction-Following (Alpaca Score)

Tests ability to follow complex instructions. Subjective eval by human raters.

ModelScore (0-10)Notes
Llama 3 70B9.2Excellent follow-through
Mistral 7B8.1Good for size
Zephyr 7B7.8Community fine-tune, solid
Phi-3 Mini7.5Decent for 3.8B

Llama 3 70B excels at multi-step instructions. Small models (7B) struggle with complex chains.

Speed (Tokens/Sec on Consumer Hardware)

RTX 4090 (24GB VRAM):

ModelQ4 SpeedQ5 SpeedFP16 Speed
Phi-3 Mini60 tok/sec50 tok/sec40 tok/sec
Mistral 7B32 tok/sec28 tok/sec20 tok/sec
Zephyr 7B28 tok/sec24 tok/sec18 tok/sec
Code Llama 13B20 tok/sec18 tok/sec12 tok/sec
Gemma 2 27B10 tok/sec9 tok/sec6 tok/sec
Llama 3 70BCan't fit (requires 40GB+)Can't fitCan't fit

H100 (80GB VRAM):

ModelQ4 SpeedQ5 SpeedFP16 Speed
Llama 3 70B100 tok/sec85 tok/sec120 tok/sec
Gemma 2 27B180 tok/sec150 tok/sec200 tok/sec
Code Llama 70B95 tok/sec80 tok/sec110 tok/sec

Quantization trades speed for VRAM. FP16 (no quantization) is fastest but requires massive VRAM.

Inference Cost (RunPod H100, per 1M tokens)

Cost = GPU hourly rate / tokens per second / 3600 seconds

ModelThroughputCost/1M tokens
Phi-3 Mini (RTX 4090 equiv)60 tok/sec$0.0058
Mistral 7B32 tok/sec$0.0062
Gemma 2 27B12 tok/sec$0.0055
Llama 3 70B100 tok/sec$0.0020
Llama 3 70B (RTX 4090 equiv)3 tok/sec$0.0664

Llama 3 70B on H100 is cheapest per token (high throughput). On consumer GPU (RTX 4090), tiny models win.


Additional Models Worth Considering

Mixtral 8x7B

Mixture-of-Experts model. Faster than Llama 3 70B for many tasks, smaller than Llama 70B.

VRAM: 24GB (quantized), 42GB (full) Speed: 15-25 tok/sec on RTX 4090 Reasoning: 71 MMLU (between Mistral 7B and Gemma 2 27B)

Use case: inference efficiency, running on RTX 4090 without compromise.

Mixtral 8x22B

Larger MoE. Closer to Llama 3 70B performance, slightly smaller VRAM footprint.

VRAM: 48GB (quantized), 88GB (full) Speed: 8-12 tok/sec on RTX 4090 (too large) Reasoning: 78 MMLU (excellent)

Use case: production inference on dual-GPU setups (2x RTX 4090 or 2x RTX A6000).

Nous Hermes 2 Mixtral

Community fine-tune of Mixtral 8x7B. Optimized for instruction-following and creative tasks.

VRAM: 24GB (quantized) Speed: same as base Mixtral 8x7B Reasoning: 73 MMLU (slight improvement over base)

Use case: alternative to Mistral 7B with better instruction-following, without penalty.

Qwen 2 7B

Alibaba's compact model. Strong multilingual support. Training data includes more Chinese/CJK.

VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 (same as Mistral 7B) Reasoning: 62 MMLU (slightly better than Mistral 7B on multilingual)

Use case: applications serving global users, especially in Asia.

Synthia 7B

Community model fine-tuned on GPT-4 outputs. Better reasoning than base 7B models.

VRAM: 4GB (quantized), 14GB (full) Speed: 32 tok/sec on RTX 4090 Reasoning: 65 MMLU (competitive with Mistral 7B)

Use case: reasoning tasks where Mistral 7B falls short, avoiding larger models.


Ollama vs Fine-Tuning

Base models from Ollama are general-purpose. Fine-tuning on domain-specific data improves accuracy but requires infrastructure:

Option 1: Fine-tune locally (RTX 4090, 24GB, 8 hours). Download base model, add 10K examples, output custom weights. Use in Ollama with custom model directory.

Option 2: Fine-tune on cloud (RunPod H100, $1.99/hr, 2-4 hours). Faster, more VRAM. Cost: $4-8 per fine-tune job.

Option 3: Use base models with prompt engineering. Elaborate system prompts can match 80% of fine-tuned accuracy without training cost.

Recommendation: start with base + prompt engineering. Fine-tune only if base model accuracy inadequate on 100+ test examples.


Installation and Hardware

Minimal Setup (Phi-3 Mini)

Laptop with 8GB RAM + CPU inference. Ollama downloads model in ~2 minutes.

ollama pull phi3
ollama run phi3 "What is 2+2?"

Expect 5-10 tokens/second on modern CPU.

Consumer GPU (Mistral 7B)

RTX 4090 or RTX A5000. ~$1,500-3,000 hardware cost.

ollama pull mistral:7b
ollama run mistral:7b

Expect 30+ tokens/second.

Cloud GPU (Llama 3 70B)

RunPod H100 at $1.99/hr, Lambda H100 at $2.86/hr. Monthly cloud cost: ~$150-200 for 8 hours daily usage.

ollama pull llama3:70b
ollama run llama3:70b

Expect 12-18 tokens/second.


FAQ

What is the fastest Ollama model?

Phi-3 Mini at 60+ tokens/second on RTX 4090. Mistral 7B is second at ~32 tokens/sec. Both are 7B or smaller. Llama 3 70B is 10x slower but 10x more capable.

Should I use Ollama or vLLM?

Ollama is simpler (single binary, downloads models automatically). vLLM is faster (optimized batching and tensor parallelism). For single-user inference, Ollama. For production APIs serving hundreds of requests/sec, vLLM or SGLang.

Can I fine-tune models in Ollama?

Not natively. Download the base model, fine-tune with Hugging Face Transformers or LLaMA-Factory, then add the weights to Ollama's model directory. Ollama handles inference, not training.

What's the quality difference between quantized (Q4) and full precision (F16)?

Negligible on modern models. Q4 reduces VRAM by 75%, speeds inference 20-30%, and loses <0.5% accuracy on most benchmarks. Always quantize unless testing for degradation.

Can I run Ollama on CPU?

Yes, slowly. Phi-3 Mini on CPU: ~1-2 tokens/second. Mistral 7B on CPU: ~0.2 tokens/second. GPU is recommended.

How do I use Ollama models in my application?

Ollama exposes a local API on localhost:11434. Use curl or HTTP client to send requests. OpenAI-compatible API available via third-party tools (llm-to-openai-api, ollama-openai-adapter).



Sources