RLHF Fine-Tuning on Single H100: Step-by-Step Guide

RLHF Fine Tune LLM Single H100: Overview
Hardware and Cost
RLHF Architecture
Setup and Dependencies
Step 1: Prepare Training Data
Step 2: Build Reward Model
Step 3: Implement PPO Training
Step 4: DPO Alternative (Recommended)
Step 5: Validate and Deploy
VRAM Optimization Strategies
Training Tips and Tricks
Troubleshooting Guide
FAQ
Related Resources
Sources

RLHF Fine Tune LLM Single H100: Overview

Rlhf Fine Tune LLM Single H100 is the focus of this guide. RLHF on a single H100 is practical now thanks to LoRA and TRL. RLHF (Reinforcement Learning from Human Feedback) is how OpenAI, Anthropic, and Mistral align models to what humans prefer. The workflow: train a reward model (learns which outputs humans like), then use PPO (Proximal Policy Optimization) to push the LLM toward higher rewards. Alternative: DPO (Direct Preference Optimization) skips the reward model, learns directly from preference pairs.

A single H100 fits 7B-13B models with LoRA. 70B models need distributed training or aggressive quantization. This guide trains Mistral 7B with both PPO and DPO on one H100, with costs and VRAM budgets.

Cost: H100 at $1.99/hr × 24 hours = $47.76 per full RLHF run. Buying an H100 ($330K) breaks even after 6.8 years continuous use. Cloud rental or shared providers (RunPod, Lambda) make sense.

Hardware and Cost

Single H100 Configuration

GPU: NVIDIA H100 PCIe, 80GB HBM2e memory
System RAM: 128GB minimum (preferably 256GB for data pipeline)
Storage: 100GB free (model weights, datasets, checkpoints)
Network: 10Gbps+ for downloading models from Hugging Face
Cloud Cost: RunPod H100 PCIe at $1.99/hr, Lambda H100 PCIe at $2.86/hr

VRAM Breakdown (Mistral 7B, RLHF Workflow)

Stage	Memory Usage	Notes
Base model (Mistral 7B, FP16)	14GB	Weights + activations
Optimizer states (Adam)	28GB	2x model size for momentum + variance
Batch size 16, seq length 512	18GB	Gradient computation
LoRA adapters	1GB	Rank 32, 64 alpha
Total active during training	61GB	Fits on H100 with margin

H100's 80GB handles this configuration with reasonable headroom. Peak usage is approximately 61GB during gradient updates.

Alternative: Quantized Setup

Use bitsandbytes 4-bit quantization to reduce VRAM by 50%:

Stage	Quantized VRAM
Base model (Mistral 7B, 4-bit)	3.5GB
Optimizer states	21GB (3x model, but smaller model)
Batch, adapters	15GB
Total	~40GB

Quantized training trades some quality but fits comfortably. 4-bit quantization costs 1-3% accuracy; usually acceptable for alignment work.

RLHF Architecture

Three-Stage Workflow

Stage 1: Supervised Fine-Tuning (SFT) Start with base model (Mistral 7B). Train on instruction-response pairs (1-10K examples). Objective: make model follow instructions well. VRAM: 45GB. Time: 2-8 hours on 1K-10K examples.

Stage 2: Reward Model Training Train separate classifier to predict human preference. Input: (prompt, response_A, response_B). Output: which response is better? This is binary classification. Use preference pairs from dataset (1K-10K pairs). VRAM: 30GB. Time: 1-4 hours.

Stage 3: PPO Optimization Use reward model to score model generations. Run inference on base model, collect samples, score them, compute PPO loss, update base model. Most VRAM-intensive (maintains base model + reward model simultaneously). VRAM: 70GB. Time: 8-24 hours.

DPO as Faster Alternative

Direct Preference Optimization (DPO) skips the reward model entirely. Directly train base model on preference pairs: (prompt, preferred_response, rejected_response). Loss function directly optimizes for preference without intermediate reward model.

Advantages: single model in memory, faster convergence, simpler pipeline. Disadvantages: direct optimization can be noisier if preference data is poor quality.

VRAM: 40GB (one model). Time: 2-6 hours. Quality: similar to PPO, sometimes better on subjective tasks.

For resource-constrained setups (single H100), DPO is preferred over PPO due to lower VRAM and faster convergence.

Setup and Dependencies

Python Environment

python -m venv rlhf-env
source rlhf-env/bin/activate

pip install torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/cu121

pip install transformers datasets peft trl \
    bitsandbytes wandb numpy scikit-learn

Key packages:

torch: Deep learning framework (CUDA 12.1 for H100)
transformers: Hugging Face models and tokenizers
peft: LoRA and quantization (LoRA config, get_peft_model)
trl: RLHF training loops (PPOTrainer, DPOTrainer)
bitsandbytes: 4-bit quantization
wandb: Experiment tracking and visualization

Model Imports

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import PPOTrainer, PPOConfig, DPOTrainer, DPOConfig
from datasets import load_dataset
import torch

Configuration Template

MODEL_ID = "mistralai/Mistral-7B-v0.1"
LEARNING_RATE = 1e-4
BATCH_SIZE = 16
EPOCHS = 3
LORA_RANK = 32
LORA_ALPHA = 64

lora_config = LoraConfig(
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

Step 1: Prepare Training Data

Instruction-Response Format (SFT Stage)

Create dataset with instruction, input, output fields:

{
  "instruction": "Summarize the following article in 3 sentences.",
  "input": "Article text here...",
  "output": "Summary here..."
}

Save as JSONL (one JSON object per line). Load and tokenize:

from datasets import load_dataset

dataset = load_dataset("json", data_files="training_data.jsonl")

def tokenize_function(examples):
    texts = [
        f"{ex['instruction']}\n{ex['input']}\n{ex['output']}"
        for ex in examples
    ]
    tokenized = tokenizer(
        texts,
        max_length=512,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )
    return tokenized

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

Preference Pairs Format (PPO/DPO)

Format for preference pairs:

{
  "prompt": "Summarize this article",
  "chosen": "Long, detailed summary with key points",
  "rejected": "Short, vague summary"
}

Expectation: >1K preference pairs for meaningful training. 10K+ pairs for production quality. Each pair should represent genuine preference (better/worse based on quality, safety, relevance).

Preference data sources:

The own annotation (expensive, ~$0.50 per pair via Mechanical Turk)
Open datasets (OpenOrca, Anthropic HH, UltraFeedback on Hugging Face)
Synthetic pairs (use GPT-4o to rank model outputs, create weak labels)

Step 2: Build Reward Model

Skip this section for DPO (uses no reward model). Include for PPO.

Reward Model Architecture

Classification head on language model. Input: prompt + response. Output: scalar score (1-5) or binary (preferred/rejected).

from transformers import AutoModelForSequenceClassification
import torch

class RewardModel(torch.nn.Module):
    def __init__(self, model_id="mistralai/Mistral-7B-v0.1"):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(
            model_id,
            num_labels=1,  # Binary reward: 0-1
            torch_dtype=torch.float16
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def score_response(self, prompt, response):
        text = f"{prompt}\n{response}"
        inputs = self.tokenizer(
            text,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to("cuda")

        with torch.no_grad():
            outputs = self.model(**inputs)
            score = torch.sigmoid(outputs.logits[0][0]).item()  # Sigmoid for [0, 1]

        return score

    def score_pair(self, prompt, response_a, response_b):
        score_a = self.score_response(prompt, response_a)
        score_b = self.score_response(prompt, response_b)
        return score_a, score_b

Training Reward Model

from torch.optim import AdamW
from torch.nn import MarginRankingLoss

def train_reward_model(model, train_loader, epochs=3, learning_rate=1e-4):
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    loss_fn = MarginRankingLoss(margin=0.5)  # Margin loss: chosen > rejected + margin

    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            prompts = batch['prompt']
            chosen = batch['chosen']
            rejected = batch['rejected']

            scores_chosen = []
            scores_rejected = []

            for prompt, c, r in zip(prompts, chosen, rejected):
                s_c, s_r = model.score_pair(prompt, c, r)
                scores_chosen.append(s_c)
                scores_rejected.append(s_r)

            scores_chosen = torch.tensor(scores_chosen, requires_grad=True)
            scores_rejected = torch.tensor(scores_rejected, requires_grad=True)

            # Loss: we want chosen > rejected
            loss = loss_fn(
                scores_chosen,
                scores_rejected,
                torch.ones_like(scores_chosen)  # All chosen should be > rejected
            )

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch}: Avg Loss {avg_loss:.4f}")

    return model

VRAM usage during reward training: ~30GB. Time: 1-4 hours on 5K-10K pairs.

Step 3: Implement PPO Training

PPO Trainer Setup

from trl import PPOTrainer, PPOConfig

ppo_config = PPOConfig(
    model_name=MODEL_ID,
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    ppo_epochs=4,
    gradient_accumulation_steps=1,
    max_new_tokens=128,
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
    dataset=train_dataset,
    data_collator=data_collator,
    reward_model=reward_model,
)

PPO Training Loop

def train_with_ppo(trainer, reward_model, num_steps=1000):
    for step in range(num_steps):
        # Sample batch from dataset
        batch = next(iter(train_loader))
        query_tensors = batch['input_ids'].to("cuda")

        # Generate responses from model
        response_tensors = trainer.generate(
            query_tensors,
            length_sampler=trainer.config.max_new_tokens,
            temperature=0.7,
            do_sample=True,
        )

        # Decode for readability
        batch_size = len(query_tensors)
        responses = [
            tokenizer.decode(response_tensors[i], skip_special_tokens=True)
            for i in range(batch_size)
        ]

        # Score responses with reward model
        rewards = []
        for i, response in enumerate(responses):
            prompt = tokenizer.decode(query_tensors[i], skip_special_tokens=True)
            score = reward_model.score_response(prompt, response)
            rewards.append(score)

        rewards = torch.tensor(rewards)

        # PPO update step
        stats = trainer.step(
            query_tensors,
            response_tensors,
            rewards
        )

        if step % 100 == 0:
            print(f"Step {step}: Reward {stats['env/reward_mean']:.4f}, "
                  f"Loss {stats['ppo/loss/total']:.4f}")

    return trainer.model

VRAM peak: 75GB during gradient computation and reward scoring. Training time: 8-24 hours for 1,000-5,000 PPO steps (until convergence).

Step 4: DPO Alternative (Recommended)

DPO is faster and simpler. Directly trains on preference pairs without reward model.

from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig(
    output_dir="./dpo_output",
    learning_rate=5e-4,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    max_prompt_length=256,
    max_completion_length=256,
    beta=0.1,  # DPO parameter (higher = stronger preference signal)
)

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,  # Trainer creates frozen reference internally
    args=dpo_config,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

dpo_trainer.train()

DPO training time: 2-6 hours (3 epochs). VRAM: 40GB. No separate reward model. Convergence: often faster than PPO (fewer moving parts).

Step 5: Validate and Deploy

Generate and Evaluate Samples

def evaluate_model(model, test_prompts, num_samples=2):
    model.eval()

    for prompt in test_prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

        print(f"\nPrompt: {prompt}")

        for i in range(num_samples):
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.eos_token_id,
            )

            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
            response_only = response[len(prompt):].strip()
            print(f"Sample {i+1}: {response_only}\n")

Save LoRA Adapter

model.save_pretrained("./lora_adapter")

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, "./lora_adapter")
merged_model = model.merge_and_unload()  # Merges LoRA into base
merged_model.save_pretrained("./merged_model")

File sizes:

Base model (Mistral 7B FP16): 14GB
LoRA adapter: 50MB
Merged model: 14GB

VRAM Optimization Strategies

1. Gradient Checkpointing

Save intermediate activations instead of storing them. Reduces VRAM by 20-30%, adds 10-20% training time.

model.gradient_checkpointing_enable()

2. 4-Bit Quantization (bitsandbytes)

Reduce model VRAM by 75%. Training speed: 5-10% slower. Quality: negligible loss.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)

3. Flash Attention 2

Faster and more memory-efficient attention. Requires FA2 support (H100 supports it).

model.config.attn_implementation = "flash_attention_2"

4. Reduce Batch Size

Trade throughput for VRAM. Batch 16 to 8 saves ~10GB, increases training time 2x.

5. Reduce Context Length

Truncate sequences to 256-512 tokens instead of 2K. Saves ~30% VRAM per batch.

6. Distributed Training (if 2+ GPUs available)

Use DeepSpeed or Fully Sharded Data Parallel (FSDP). Reduces per-GPU VRAM.

Combined Optimization

Quantized (4-bit) + gradient checkpointing + flash attention + batch 4 + context 256 = 20GB VRAM (works on RTX 4090). Trade-off: 3-4x slower training.

Training Tips and Tricks

Data Quality

Preference data quality is critical. Bad pairs (chosen not actually better than rejected) leads to misalignment. Spend time curating.

Signs of bad preference data:

Chosen and rejected responses are nearly identical
Rejected is objectively better (rare, but happens with synthetic data)
No clear preference signal

Learning Rate Scheduling

PPO is sensitive to learning rate. Start with 1e-5, monitor loss. If diverging (loss increasing), reduce to 5e-6. If not converging (loss stagnant), increase to 2e-5.

DPO is less sensitive; 5e-4 is standard.

Monitoring and Logging

Use WandB (Weights & Biases) for experiment tracking:

from wandb import login
login()  # Authenticate with WandB API key

Watch for:

Reward increasing (good)
KL divergence from reference model (should stay < 5 bits/token)
Loss decreasing (good)

Checkpoint Frequency

Save checkpoints every 100-200 steps. Don't overwrite; keep all checkpoints. Evaluate on validation set at each checkpoint. Pick best checkpoint (highest reward + quality).

Troubleshooting Guide

OOM (Out of Memory) Error

Verify VRAM with nvidia-smi. Peak should be <80GB.
Enable gradient checkpointing: model.gradient_checkpointing_enable()
Reduce batch size (16 to 8 or 4).
Enable 4-bit quantization.
Reduce max sequence length (512 to 256).

Training Loss Not Decreasing

Check learning rate. PPO: 1e-5 typical. DPO: 5e-4 typical.
Verify reward model is working (test manually on known good/bad pairs).
Check preference data quality (sample and review pairs).
Increase training steps/epochs.
Check for NaNs in loss (indicator of numerical instability).

Reward Model Scores All Same

Reward model may be randomly initialized. Test on known pairs first.
Verify loss function (should be margin loss, not MSE).
Check tokenizer compatibility between base and reward models.
Train reward model for more epochs.

Divergence During PPO

Reduce PPO learning rate by 2-4x (1e-5 to 5e-6).
Reduce PPO epochs (ppo_epochs=2 instead of 4).
Increase mini_batch_size (gradient accumulation).
Verify reward model is not over-confident (check score ranges: should be 0.3-0.7 typically, not 0-1 extremes).

FAQ

How long does full RLHF take on H100?

SFT: 2-8 hours (1K-10K examples). Reward model: 1-4 hours (5K-10K preference pairs). PPO: 8-24 hours (1K-5K steps). DPO: 2-6 hours (3 epochs). Total: 12-36 hours (PPO) or 5-18 hours (DPO).

Can I train 13B or 70B models on single H100?

13B: Yes, with quantization and LoRA. 70B: Not practical on single H100. Use distributed training (8xH100) or quantization + offloading (very slow).

Is RLHF worth it for small datasets?

Yes if dataset is high-quality (100+ preference pairs). Marginal gains with <50 pairs. SFT (supervised fine-tuning) is more efficient for small data.

Where do I get preference data?

OpenOrca (Hugging Face): 1M preference pairs
Anthropic HH: 160K preference pairs
UltraFeedback: 64K preference pairs
Annotate yourself (expensive, $0.50 per pair)
Synthetic (use GPT-4o to rank outputs, create weak labels)

How do I evaluate if RLHF improved the model?

A/B test: generate samples from original and RLHF-trained models, have humans rate. Or use automatic metrics: BLEU, ROUGE, BERTScore (check if preferred outputs score higher).

Should I use PPO or DPO?

DPO is simpler and faster. Use DPO for 90% of cases. PPO is slower but may have slight quality advantage if reward model is well-trained. Start with DPO; switch to PPO only if quality plateaus.

How do I merge LoRA adapter with base model?

from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(MODEL_ID)
model = PeftModel.from_pretrained(base, "./lora_adapter")
merged = model.merge_and_unload()
merged.save_pretrained("./final_model")

Can I deploy LoRA separately (without merging)?

Yes. Load base model + LoRA adapter at inference time: PeftModel.from_pretrained(base, adapter_path). Saves storage but requires larger inference VRAM.

Contents