Last updated: April 5, 2026 · Safety & Alignment · by Daniel Ashford

What is Constitutional AI?

QUICK ANSWER

Anthropic approach to safety that trains models using written principles rather than solely human ratings.

Definition

Constitutional AI (CAI) is an alignment methodology where a model is trained to follow written principles (a "constitution") rather than relying solely on human feedback for every decision.

How It Works

Two phases: (1) Self-critique — the model generates, critiques against principles, and revises responses. (2) RLAIF — a model trained on the constitution scores outputs, replacing human reward models. This scales better than RLHF and makes values more transparent. Claude models use Constitutional AI.

Example

A CAI principle: "Choose the response that is most helpful while being least likely to cause harm. If a request could be interpreted as harmful or benign, assume the benign interpretation."

Related Terms

RLHF (Reinforcement Learning from Human Feedback)
The training technique that makes LLMs helpful and safe by learning from human preferences.
Alignment
The challenge of making AI systems behave in accordance with human values.
AI Safety Score
A measure of how well a model avoids harmful outputs and maintains appropriate guardrails.

See How Models Compare

Understanding constitutional ai is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated