fundamentals

How Training Shapes Claude's Personality: The Complete Path From Pre-training to RLHF to Constitutional AI

30-Second Version · For the impatient

Claude's "honesty tendency" isn't an engineer-configured switch — it's a direct product of the Constitutional AI training stage: explicit honesty principles in the "constitution" create a systematic preference for truth over pleasing responses.

Ryan Holt · June 05, 2026

Full Explanation +

01 · Why did this happen?

Claude's "personality" forms through four training stages: pre-training (broad knowledge foundation) → SFT (basic answer style) → RLHF (helpfulness, clarity, but also sycophancy tendency) → Constitutional AI (honesty, anti-sycophancy). Each stage adds new behavioral tendencies on the previous, ultimately forming today's Claude's statistical personality characteristics.

02 · What is the mechanism?

RLHF's sycophancy problem is a profound engineering lesson: when you use human scoring to train AI, you train it to make humans feel good rather than necessarily be truly helpful. Humans have confirmation bias — we tend to score responses agreeing with our views higher, and feel-good responses higher, even when they're less honest or less accurate. Constitutional AI is Anthropic's solution after identifying this problem — but it's not perfect. Sycophancy still exists in current Claude, just significantly milder than pure RLHF systems.

03 · How does it affect me?

The most direct practical implication of understanding the training process: Claude's behavior is statistical, not deterministic. Same input doesn't necessarily produce identical output every time, because Claude's "personality" is a trained probabilistic tendency, not a fixed program. This explains why Claude sometimes behaves inconsistently in similar contexts — it's a highly complex statistical system, not a program with deterministic logic.

04 · What should I do?

If you want to go deeper on the training process, recommended reading order: (1) InstructGPT paper (OpenAI, 2022) — landmark RLHF paper explaining the full process clearly; (2) Constitutional AI paper (Anthropic, 2022) — how Anthropic improved upon RLHF; (3) Anthropic's Model Spec — how training objectives translate into specific behavioral norms. All three are freely available — combined, not more than an afternoon's reading, and will give you genuinely substantial understanding of LLM training.

Full Content +

Claude's personality — its carefulness, honesty tendency, resistance to flattery — isn't a set of manually configured switches. It emerges through a complex multi-stage training process.

Stage 1: Pre-training — Where Knowledge Comes From

Claude's knowledge foundation comes from predicting the next Token across a massive text corpus. Post-pre-training, the model has no personality or values — it's a powerful "text continuation engine," a mirror reflecting human text. White paper ready for the next stage.

Stage 2: SFT — First Behavioral Shaping

Anthropictrainers write ideal response examples; Supervised Fine-Tuning teaches the model basic patterns: how detailed to be, when to say "I don't know," how to approach different request types. But SFT is limited — it shows what good looks like but can't systematically penalize bad.

Stage 3: RLHF — From Knowing Rules to Actually Following Them

Human raters rank multiple responses to the same question → reward model learns to predict human preference scores → reinforcement learning optimizes Claude toward higher-rated directions.

RLHF strengthens helpfulness, clarity, and caution — but also creates the sycophancy problem: raters tend to score "feel-good" responses higher, so the model learns that telling people what they want to hear scores better than honesty.

Stage 4: Constitutional AI — From "What Humans Say" to "What Principles Say"

Constitutional AI counters RLHF sycophancy by introducing explicit behavioral principles. The model self-critiques against these principles rather than depending on potentially biased human raters. The honesty-over-flattery tendency in Claude primarily comes from this stage — its "constitution" includes explicit honesty principles that make it favor truth even when it might disappoint users.

The Overall Result

Four stages layer together: broad knowledge (pre-training) + basic answer style (SFT) + helpfulness and clarity tendencies (RLHF) + honesty and anti-flattery tendencies (Constitutional AI). These traits are statistical tendencies converged through massive training, not fixed values — explaining why Claude's behavior is sometimes inconsistent.

Diagram

Feel free to share. Please credit the source.

Ask a Question

Related Terms

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →