Last updated: April 5, 2026 · Safety & Alignment · by Daniel Ashford
What is RLHF (Reinforcement Learning from Human Feedback)?
The training technique that makes LLMs helpful and safe by learning from human preferences.
Definition
RLHF is a training methodology to align language model behavior with human preferences. After pre-training, RLHF uses human ratings of outputs to train a reward model, which guides further training toward helpful, harmless, and honest responses.
How It Works
Three stages: (1) Supervised fine-tuning on high-quality human examples. (2) Reward model training from human rankings of outputs. (3) Reinforcement learning to maximize the reward model score. Variants include RLAIF (AI feedback), DPO (Direct Preference Optimization), and Constitutional AI.
Example
Without RLHF, a model asked "How do I pick a lock?" might give detailed instructions. After RLHF, it recognizes the risk and suggests contacting a locksmith.
Related Terms
See How Models Compare
Understanding rlhf (reinforcement learning from human feedback) is important when choosing the right AI model. See how 12 models compare on our leaderboard.