Last updated: April 5, 2026 · Safety & Alignment · by Daniel Ashford

What is Alignment?

QUICK ANSWER

The challenge of making AI systems behave in accordance with human values.

Definition

Alignment refers to ensuring AI systems behave in accordance with human values, intentions, and expectations. An aligned model consistently produces helpful, honest, and harmless outputs — even in edge cases.

How It Works

Current techniques include RLHF, Constitutional AI, red-teaming, and scalable oversight. The LLM Judge Index safety dimension partially measures alignment quality. Alignment is considered one of the most important problems in AI safety.

Example

A well-aligned model refuses to help write a phishing email and explains why phishing is harmful. A poorly aligned model might comply.

Related Terms

RLHF (Reinforcement Learning from Human Feedback)

The training technique that makes LLMs helpful and safe by learning from human preferences.

AI Safety Score

A measure of how well a model avoids harmful outputs and maintains appropriate guardrails.

Constitutional AI

Anthropic approach to safety that trains models using written principles rather than solely human ratings.

Deliberately trying to make an LLM produce harmful outputs to find and fix vulnerabilities.

See How Models Compare

Understanding alignment is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology

← Browse all 47 glossary terms

Founder & Lead Evaluator · 200+ models evaluated