Can Constitutional AI and RLHF be used together, or does one replace the other?
They are not substitutes — they are combined. Anthropic's approach integrates Constitutional AI and RLHF into the same training pipeline: the SL-CAI stage uses supervised learning; the RLHF-CAI stage uses reinforcement learning with AI-generated preference labels. Think of Constitutional AI as 'a framework that gives RLHF's preference labeling a principled basis,' not a fully independent alternative.
In practice, Anthropic still uses human feedback as a supplement, especially in contexts requiring subjective judgment. Constitutional AI's main contribution is making the process more principled and scalable — not eliminating the human role entirely.
Where do the principles in the 'constitution' come from, and why does Anthropic get to decide?
This is a legitimate question that Anthropic itself acknowledges in its public writing. The sources of the constitutional principles include: the UN Declaration of Human Rights (chosen because it has broad international consensus); Anthropic researchers' own judgments about what AI behavior is beneficial to people; and perspectives distilled from user and societal feedback.
Why Anthropic? Because they are the company training the model, and there is currently no recognized external institution capable of doing this. Anthropic's position: making the constitution public so anyone can read, critique, and debate its contents is itself an accountability mechanism. They also acknowledge that who writes the constitution is an open question, and future alignment research may find more participatory and representative ways to determine these principles.
Is it reliable to have the AI critique itself using principles? Won't it have blind spots?
Yes, it will have blind spots, and Anthropic acknowledges this. A few systematic problems can arise when the AI critiques its own responses using principles: first, if the principles are worded ambiguously, the AI's interpretation of them may already be skewed, and its critique follows. Second, the AI's training data may carry certain biases, making it more or less sensitive to particular types of harmful content than to others. Third, the AI is likely poor at critiquing things it 'doesn't know it doesn't know' — blind spots it isn't aware of can't be critiqued.
This is also why Anthropic combines Constitutional AI with other methods including human red-teaming rather than leaving everything to AI self-critique. Self-critique is an effective tool, but not a complete solution.
Advanced: how does Constitutional AI relate to the Claude character document (soul document) Anthropic mentions?
They operate at different levels but influence each other. Constitutional AI is a training methodology — it determines how the training process runs and what principles guide model behavior. Claude's character document (called the soul document or character spec internally at Anthropic) describes what kind of personality, values, and communication style Claude as an AI assistant should have.
The relationship: the constitutional principles influence the tendencies the model develops during training; the character document further shapes how those tendencies express in actual interactions. Think of the constitution as the tool that 'shapes the model's skeleton,' and the character document as the description of 'how that skeleton moves in daily life.' Together they explain why Claude responds to the same question in a particular way, with a particular tone.
Have you ever wondered how Anthropic teaches Claude to know what it should and shouldn't say? The answer isn't a list of rules or a human-curated filter. It's called Constitutional AI — a training method that teaches an AI to judge its own behavior using principles rather than a checklist of examples.
The goal of this article is to help you genuinely understand the core logic of this approach — not just recognize the term, but be able to explain how it differs from other alignment methods and why it has a real effect on how Claude behaves.
Before Constitutional AI, the dominant AI alignment method was RLHF (Reinforcement Learning from Human Feedback). The process: human evaluators compare two responses and pick the better one; those choices train a preference model; the preference model then guides the AI toward better outputs.
The problem is scale. You need large volumes of human annotation, which costs money, takes time, and produces inconsistent judgments across evaluators. More fundamentally, evaluators are voting by feel, not applying a defined set of principles. This makes it difficult for the AI to learn a stable, explainable standard of judgment from the process.
Anthropic's solution: instead of having humans do the scoring, give the AI a set of explicitly written principles — the 'constitution' — and have the AI use those principles to evaluate and improve its own responses.
The constitution is a document of several dozen principles covering harmlessness, honesty, and benefit to people. The sources are varied: the spirit of the UN Declaration of Human Rights, Anthropic's own research judgments, and insights distilled from user feedback. The key is that these principles are readable, explicit, and open to challenge — not buried in a black box of human annotations.
Stage one is SL-CAI (supervised): Claude generates responses to harmful or problematic prompts, then critiques its own responses using the constitutional principles ('where does this response violate the honesty principle?'), then rewrites an improved version. The 'original response → improved response' pairs are used for supervised fine-tuning.
Stage two is RLHF-CAI (reinforcement learning): two responses are generated for the same prompt, and the AI itself uses the constitutional principles to judge which is better. Those AI-generated rankings become training data for a preference model, which is then used for reinforcement learning. In this loop, the AI is both student and evaluator — but the evaluation is grounded in explicit principles, not intuition.
Constitutional AI's two most important contributions: first, explainability — you can read the constitution, understand the principles Claude uses to make a judgment, and have a basis for questioning and discussing its behavior rather than just treating it as a black box. Second, scale efficiency — by reducing dependence on large-scale human annotation, the training process can iterate faster and at larger scale.
The limits are real too. The content of the constitution is decided by Anthropic, so 'who writes the constitution' is itself a question of power. How principles are worded affects AI behavior — vague principles produce vague judgments. And when the AI critiques its own responses using those principles, it may have systematic blind spots and won't always accurately identify its own problems.
The most direct practical effect of understanding Constitutional AI: Claude's refusals and limits are not random — they trace back to principles in the constitution. If Claude declines a request, you can usually find a constitutional principle behind that decision. This also means that if you can reframe your request so that it looks more reasonable under those principles, Claude's response may change. Understanding this logic shifts your interaction with Claude from 'guessing' to 'knowing what you're actually talking to.'