fundamentals

How Claude Learns to Be "Helpful to Humans": RLHF and Constitutional AI Explained

30-Second Version · For the impatient

RLHF teaches Claude what responses humans prefer. Constitutional AI teaches it what responses are actually right. Their combination is what makes Claude both helpful and honest.

Sophie Marlowe · June 03, 2026

Full Explanation +

01 · Why did this happen?

Claude's training involves two main stages: pre-training (learning language patterns) and Alignment training (learning to be "helpful to humans"). The main alignment training methods are RLHF (guiding the model with human feedback preferences) and Constitutional AI (self-evaluation based on a set of explicit behavioral principles). Their combination enables Claude to generate useful responses while honestly acknowledging uncertainty and providing meaningful explanations when declining requests.

02 · What is the mechanism?

RLHF was systematized by OpenAI in the 2017-early 2020s and applied at scale in InstructGPT training, later becoming the core method for training ChatGPT. Anthropic's Constitutional AI is an innovation building on RLHF, addressing the problem of RLHF's dependence on human preference annotation — annotator biases and inconsistencies directly affect trained model behavior. Constitutional AI attempts to replace subjective preference judgments with explicit principles.

03 · How does it affect me?

Understanding RLHF and Constitutional AI helps explain behavioral differences between Claude and other AI tools. Purely RLHF-trained systems are prone to "sycophancy" — tending to tell users what they want to hear rather than truthful answers. Constitutional AI's addition makes Claude notably different on this point: it's trained to remain honest even when users don't like the answer, which explains why Claude sometimes gives responses that differ from your expectations rather than simply echoing your viewpoint.

04 · What should I do?

Translate understanding of RLHF and Constitutional AI into practical usage techniques: if you want honest feedback rather than flattery, explicitly tell Claude "I don't need you to agree with my viewpoint — I need you to tell me where the problems are"; if Claude declines your request, asking "why?" typically gets a meaningful explanation rather than a formulaic "I can't help with that"; if you're unsure whether Claude's answer is accurate, directly ask "how confident are you in this answer? Are there parts you're uncertain about?" — it's trained to honestly express uncertainty in these situations.

Full Content +

A freshly trained language model is like a scholar who has read a vast amount of books but has no idea what "humans want." It can generate text — but not necessarily helpful, safe, or honest text. So how did Anthropic turn it into Claude?

Pre-training Is Just the Starting Point

All large language models begin with "pre-training" — exposing the model to massive text data to learn statistical patterns of language: what words typically follow what words, which sentence structures are grammatically valid, what characteristics different topics' text has.

After pre-training, the model's capabilities are already impressive — it can generate fluent text and understand complex semantics. But "generating fluent text" doesn't equal "being helpful to users." A freshly pre-trained model given a question might output any text related to that question, including incorrect information, biased viewpoints, or even harmful content — because all of these appear in its training data.

This is why a second stage is needed: Alignment Training.

RLHF: Guiding the Model With Human Preferences

RLHF (Reinforcement Learning from Human Feedback) is currently the most mainstream alignment training method — OpenAI used it to train ChatGPT, and Anthropic uses it too.

RLHF's process roughly works as follows:

Step 1: Supervised Fine-Tuning (SFT) Starting from the pre-trained model, human annotators demonstrate what ideal responses look like — for various types of questions, humans write what they consider the best responses, then this "demonstration data" is used to fine-tune the model.

Step 2: Train a Reward Model Human annotators evaluate and rank multiple different responses to the same question. These preference data train a "reward model" — whose job is to predict "how good would humans consider this response" and give it a score.

Step 3: Reinforcement Learning Optimization Using the reward model as a "teacher," the language model tries various different responses, receiving rewards for directions that score highly and penalties for those that score low. Through large volumes of "try-reward-adjust" cycles, the model gradually learns to generate responses humans consider better.

RLHF's Limitation: Human Annotators Are Biased

RLHF's core problem: it uses "human preference" as a proxy for "correct answers" — but human preference doesn't equal correct or safe. Annotators may prefer responses that sound more confident (even if inaccurate), prefer longer more detailed responses (even if shorter is better), or have systematic biases on cultural and political topics.

This is why Anthropic developed Constitutional AI in addition to RLHF.

Constitutional AI: Giving Claude a Clear Set of Value Principles

Constitutional AI is an alignment training approach Anthropic proposed in 2022. Its core idea: rather than having human annotators judge which responses are better, give the model an explicit set of behavioral principles and let it self-evaluate based on those principles.

This "constitution" is an explicitly written list of principles, such as: "should not help humans do things that could cause large-scale harm," "should be honest and not deceive users," "should respect human autonomy and not push particular viewpoints."

Step 1: AI Critique — let the model critique its own responses, identifying problems according to constitutional principles.

Step 2: AI Revision — let the model revise responses based on its own critique.

Step 3: AI Preference Labeling — Anthropic's key Constitutional AI innovation: let the model itself (not human annotators) score and rank responses, based on constitutional principles rather than purely subjective preferences.

The Combination of These Methods Shapes Claude's Personality

Claude is trained through a combination of RLHF and Constitutional AI, which explains several of its distinctive characteristics:

Honesty over flattery: Claude is trained to remain honest even when responses make users uncomfortable, rather than saying what they want to hear. Pure RLHF systems easily learn that "humans like being told what they want to hear" and trend toward sycophancy.

Proactively acknowledging limitations: Claude is trained to express uncertainty when uncertain, and acknowledge when things are beyond its capabilities — rather than pretending omniscience.

Refusals with reasoning: Claude doesn't decline requests because of a "prohibited items list" — it learned a set of principles and judges based on them. This lets it provide meaningful explanations for why it declines.

Diagram

Feel free to share. Please credit the source.

Ask a Question

Related Terms