DAPO: Open-Source RL Training for Reasoning LLMs

Deploybase · December 15, 2025 · LLM Guides

Contents


DAPO Open Source LLM Reinforcement Learning: Overview

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization) is an open-source reinforcement learning system for training reasoning LLMs. It addresses a key gap: while companies like OpenAI and DeepSeek have created sophisticated reasoning models, their training methods are not fully disclosed. DAPO provides a reproducible, open-source alternative that achieved 50 points on AIME 2024 using a Qwen2.5-32B base model. Training code, built on the verl framework, and a curated dataset are publicly released.

DAPO was published in early 2025 and introduces several technical improvements over standard GRPO-based RL training: decoupled clipping (separate clip thresholds for positive and negative advantages), dynamic sampling (filtering out prompts where all samples are correct or all wrong), and token-level policy gradient loss. The result is more stable training and better reasoning performance compared to naive RL approaches.


DAPO vs RLHF vs DPO: Key Differences

RLHF (Reinforcement Learning from Human Feedback)

Workflow:

  1. Train a reward model on preference pairs (model A vs model B).
  2. Use the reward model to score generated text.
  3. Train the LLM using PPO (policy gradient updates).
  4. Repeat.

Pros: High-quality alignment, empirically proven (used for GPT-4, Claude).

Cons: Requires reward model training (complex), PPO is notoriously unstable (hyperparameter tuning hell), computationally expensive (3-4 stages), takes weeks.

Typical cost: 100+ GPU-hours for a 13B model.

DPO (Direct Preference Optimization)

Workflow:

  1. Collect preference pairs.
  2. Train the LLM directly on preference pairs using a binary classification loss (no reward model, no PPO).
  3. Done.

Pros: Simple, fast (single training loop), stable, low compute.

Cons: Requires a frozen reference model (complicates distributed training), doesn't decouple alignment from model updates, limited flexibility.

Typical cost: 10-20 GPU-hours for a 13B model.

DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization)

Workflow:

  1. Start with a base language model.
  2. Use rule-based reward functions (e.g., correctness checking for math/code).
  3. Apply GRPO-style policy gradient updates with four key improvements: decoupled clip thresholds, dynamic sampling (filter prompts where all responses are correct/wrong), token-level policy gradient loss, and overlong reward shaping.
  4. Iterate until reasoning performance stabilizes.

Pros: More stable RL training than naive GRPO, does not require a trained reward model (uses rule-based rewards), open-source (code + dataset released), achieves strong AIME performance (50 points with Qwen2.5-32B).

Cons: Still requires GPU-heavy RL training, more complex than DPO, primarily designed for reasoning/math tasks.

Typical cost: GPU cluster training (similar to RLHF scale for the RL phase).

Side-by-Side Comparison

AspectRLHFDPODAPO
ComplexityHigh (4 stages: SFT, reward model, RL, iterate)Low (1 stage: direct preference loss)High (RL training with 4 algorithmic improvements)
Compute Cost100+ GPU-hrs10-20 GPU-hrsHigh (GPU cluster RL training, similar to RLHF)
Training Time2-4 weeks1-2 daysDays to weeks (depending on model size)
StabilityModerate (PPO tuning)HighHigh (DAPO specifically designed for training stability)
Reward ModelRequired (trained reward model)Not needed (preference pairs used directly)Not needed (rule-based rewards, e.g., math correctness)
Best ForGeneral alignment, chat qualityStyle and preference alignmentReasoning tasks (math, code, logic)
Open-Source SupportModerate (TRL, but PPO is hard)Excellent (DPO implementations mature)Available via verl framework (released with paper)
Production AdoptionYes (proprietary teams)Yes (open-source teams)Early (paper released March 2025)

How DAPO Works

Conceptual Framework

DAPO is an RL training algorithm that builds on GRPO (Group Relative Policy Optimization) and addresses four key instability problems in large-scale LLM reinforcement learning:

Technique 1: Decoupled Clip

Standard GRPO uses a single clip parameter ε for both positive and negative advantages in the PPO-style objective. DAPO uses separate clip thresholds (ε_low for negative advantages, ε_high for positive advantages). This prevents the policy from collapsing on easy samples while still allowing meaningful updates on hard samples.

Technique 2: Dynamic Sampling

Standard GRPO can waste compute training on prompts where all generated samples are correct (trivial) or all are wrong (too hard). DAPO filters these out dynamically during training, keeping only prompts where at least one sample is correct and at least one is wrong. This maintains a useful learning signal.

Technique 3: Token-Level Policy Gradient Loss

Instead of computing loss at the sequence level (averaging across complete responses), DAPO computes loss at the token level. This prevents longer responses from contributing disproportionately to the gradient, stabilizing training especially when responses vary widely in length.

Technique 4: Overlong Reward Shaping

Very long outputs (exceeding a max length threshold) receive a soft penalty that gradually reduces rewards rather than hard truncation. This discourages the model from generating excessively long reasoning chains without abruptly cutting off training signal.

Training Loop

DAPO uses rule-based reward functions (not a trained reward model):

  1. Sample N completions from the policy model for each prompt.
  2. Score each completion using a rule-based reward (e.g., math answer correctness check, code execution pass/fail).
  3. Apply dynamic sampling to filter uninformative prompts.
  4. Compute token-level policy gradient loss with decoupled clip.
  5. Apply gradient update; repeat.

This approach does not require training a separate reward model — rewards come directly from objective correctness checks.


Training Workflow

DAPO training follows an RL loop using the verl framework (released by the DAPO authors).

Step 1: Prepare Reasoning Prompts

Collect math or coding problems with verifiable ground-truth answers. The DAPO paper used AIME and competition math problems. The reward is simply: does the model's final answer match the ground truth?

Format: JSON lines with a prompt and the expected answer.

{
  "prompt": "Solve: Find all positive integers n such that n^2 + 2 divides n^3 + 4.",
  "answer": "1, 2"
}

Data Sources:

  • AIME problems (publicly available)
  • AMC competition problems
  • Code execution tasks (pass/fail reward from running tests)

Step 2: Run DAPO RL Training

DAPO uses the verl framework. Key configuration parameters:

  • clip_ratio_low: lower clip threshold for negative advantages (e.g., 0.2)
  • clip_ratio_high: upper clip threshold for positive advantages (e.g., 0.28)
  • max_response_length: cap for overlong reward shaping
  • filter_groups: enable dynamic sampling (filter all-correct/all-wrong groups)

Hardware: Multi-GPU setup (the DAPO paper used a GPU cluster for 32B training). Time: Days to weeks depending on model size and dataset.

Step 3: Evaluate on Reasoning Benchmarks

After training, evaluate on AIME 2024, MATH-500, or similar reasoning benchmarks to measure improvement. The DAPO paper reports 50 points on AIME 2024 using Qwen2.5-32B as the base model.


Hardware Requirements

DAPO is RL training on large models, which requires substantial GPU resources.

Small-Scale Experiments (7B–13B models)

  • 4–8x 80GB GPUs (H100 or A100)
  • 64GB+ system RAM
  • 1TB disk (checkpoints)

Time: 1–3 days for a full training run.

Cost (cloud): $200–800 depending on GPU hours.

Production Scale (32B model, as in the DAPO paper)

  • GPU cluster (multiple nodes, 32–64 GPUs)
  • High-speed interconnect (NVLink or InfiniBand recommended)
  • Multi-TB disk for checkpoints

Time: Days to weeks.

Cost (cloud): $5,000–$50,000+ depending on scale and duration.

Note on Compute Requirements

Unlike DPO (which trains on preference pairs in a single pass), DAPO is RL training and requires many iterations of generation + policy updates. The compute is comparable to RLHF — DAPO's advantage is training stability and not needing a separate reward model, not reduced compute.


Implementation Examples

Example 1: Simple Alignment Model on Llama 2 7B

Goal: Create an alignment model that scores completions for a summarization task.

Dataset: 5K summaries (human-written vs AI-generated), labeled with preference.

from datasets import load_dataset
from trl import DPOTrainer
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

dataset = load_dataset("json", data_files="summary_preferences.jsonl")

training_args = TrainingArguments(
    output_dir="./alignment_model",
    learning_rate=5e-4,
    per_device_train_batch_size=4,
    num_train_epochs=3,
    max_steps=5000,
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
    peft_config=None,  # Use full model, not LoRA
)

trainer.train()

model.save_pretrained("./alignment_model_final")
tokenizer.save_pretrained("./alignment_model_final")

Result: A 7B model trained to score summaries. Deploy for inference ranking.

Example 2: Using DAPO for Code Generation

Goal: Align a CodeLlama 7B model for generating bug-free Python.

Dataset: 10K code snippets, each with 2 implementations (correct vs incorrect). Labels indicate correctness.

from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
import json

model = AutoModelForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-Instruct-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/CodeLlama-7b-Instruct-hf")

def format_dataset(examples):
    formatted = []
    for item in examples:
        formatted.append({
            "prompt": item["prompt"],
            "chosen": item["correct_code"],
            "rejected": item["incorrect_code"],
        })
    return formatted

dataset = load_dataset("json", data_files="code_pairs.jsonl")
dataset = dataset.map(format_dataset, batched=True)

training_args = TrainingArguments(
    output_dir="./code_alignment",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    max_steps=10000,
    save_steps=500,
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
)

trainer.train()

def rank_code(prompt, candidates):
    scores = []
    for code in candidates:
        inputs = tokenizer(prompt + code, return_tensors="pt")
        logits = model(**inputs).logits
        score = logits.mean().item()  # Simplified scoring
        scores.append(score)
    best_idx = scores.index(max(scores))
    return candidates[best_idx]

prompt = "Write a function to sort a list:"
candidates = [
    "def sort_list(x):\n    return sorted(x)\n",
    "def sort_list(x):\n    for i in range(len(x)):\n        for j in range(i+1, len(x)):\n            if x[j] < x[i]:\n                x[i], x[j] = x[j], x[i]\n    return x\n"
]

best_code = rank_code(prompt, candidates)
print(best_code)

Result: Ranked code completions by correctness.

Example 3: Safety-Focused Alignment

Goal: Train an alignment model that filters unsafe completions (jailbreaks, harmful content).

from trl import DPOTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

dataset = load_dataset("json", data_files="safety_pairs.jsonl")

training_args = TrainingArguments(
    output_dir="./safety_alignment",
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    num_train_epochs=5,
)

trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    tokenizer=tokenizer,
)

trainer.train()

def filter_completions(prompt, candidates):
    safe_scores = []
    for text in candidates:
        inputs = tokenizer(prompt + text, return_tensors="pt").to(model.device)
        with torch.no_grad():
            logits = model(**inputs).logits
        safety_score = logits.mean().item()
        safe_scores.append(safety_score)

    # Keep completions above safety threshold
    safe_completions = [c for c, s in zip(candidates, safe_scores) if s > 0.5]
    return safe_completions if safe_completions else [candidates[0]]

Result: A safety filter that ranks completions by harmlessness.


Use Cases

Use Case 1: Fine-Tuning Open-Source Models for Domain-Specific Tasks

A healthcare startup wants to fine-tune Llama 2 7B for medical report summarization. Quality matters (liability risk), but training from scratch isn't feasible.

Workflow:

  1. Collect 5K medical reports + human summaries (preference pairs: good vs mediocre).
  2. Train alignment model on preferences (8 hours, 1 GPU).
  3. Use alignment model to score LLM completions during inference.
  4. Distill back into Llama 2 (optional, adds 12 hours).

Result: Domain-aligned model without RLHF complexity.

Cost: $20-50 in cloud GPU time.

Use Case 2: Content Moderation and Safety Alignment

A platform needs to filter harmful content in user-generated models.

Workflow:

  1. Collect 10K (harmful vs benign) text pairs.
  2. Train alignment model to detect harmful content.
  3. Use as a post-processing filter for all model outputs.

Result: Safety filter that can be updated weekly as new risks emerge.

Cost: $30-60 per update cycle.

Use Case 3: Multi-Model Consistency

A company runs multiple LLMs (Mistral, Llama, CodeLlama) and wants consistent alignment across all.

Workflow:

  1. Train a single alignment model on company-specific preferences.
  2. Use the same alignment model to rank outputs from all LLM variants.
  3. Deploy as inference-time ranker, no LLM retraining required.

Result: Unified alignment across heterogeneous models without retraining each one.

Cost: Single alignment model training, no per-LLM cost.


FAQ

How does DAPO compare to RLHF in terms of final model quality?

DAPO is not RLHF. DAPO achieves 85-95% of RLHF quality on benchmark tasks (measured by human evals, not automated metrics). RLHF is gold standard (used by OpenAI, Anthropic). DAPO is pragmatic: 90% quality, 20% cost and 10% complexity.

Can I use DAPO on proprietary models (GPT-4, Claude)?

No. DAPO requires access to model weights and the ability to train or fine-tune. You cannot align models you don't control. For proprietary models, you're limited to prompting techniques (system prompts, few-shot examples).

Does DAPO work for safety alignment or just performance alignment?

Both. DAPO learns whatever signal you give it. If you label (unsafe vs safe) completions, it learns safety. If you label (correct vs incorrect) code, it learns correctness. The mechanism is agnostic to the preference signal.

What's the minimum dataset size to train a useful alignment model?

Empirically, 1K-2K preference pairs can train a usable alignment model for a specific task. 10K-50K pairs are recommended for production use. Beyond 100K, returns diminish.

How often should I retrain the alignment model?

Depends on preference drift. For static tasks (summarization, classification), retrain quarterly or when user feedback indicates degradation. For dynamic tasks (content moderation, safety), retrain monthly as new risks emerge.

Can DAPO replace traditional LoRA fine-tuning?

Not exactly. LoRA fine-tuning updates LLM weights (in-place adaptation). DAPO trains a separate scoring model. Use DAPO when you want modularity and flexibility (swap alignment models, keep LLM fixed). Use LoRA when you want a single, self-contained adapted model.

What's the inference latency hit from using an alignment model?

If using alignment model for post-hoc ranking: add 10-50ms per N candidates (depends on model size and N). If distilling back into LLM: zero latency hit (alignment signals are baked into LLM). Most teams use post-hoc ranking to avoid LLM retraining.

Can I combine DAPO with quantization (e.g., 4-bit, 8-bit)?

Yes. Train alignment model on full precision, then quantize for deployment. Or quantize before distillation. Quantization doesn't interfere with DAPO training.

How does DAPO handle distributional shift (preferences change over time)?

Like other preference-based methods, DAPO assumes preferences are stable. If user preferences shift (e.g., safety bar rises, style preferences change), alignment model degrades. Mitigate by retraining on new preference data quarterly or more frequently.



Sources