How Training Shapes Claude's Personality: The Complete Path From Pre-training to RLHF to Constitutional AI
30-Second Version · For the impatient
Claude's "honesty tendency" isn't an engineer-configured switch — it's a direct product of the Constitutional AI training stage: explicit honesty principles in the "constitution" create a systematic preference for truth over pleasing responses.
Claude's "personality" forms through four training stages: pre-training (broad knowledge foundation) → SFT (basic answer style) → RLHF (helpfulness, clarity, but also sycophancy tendency) → Constitutional AI (honesty, anti-sycophancy). Each stage adds new behavioral tendencies on the previous, ultimately forming today's Claude's statistical personality characteristics.
02 · What is the mechanism?
RLHF's sycophancy problem is a profound engineering lesson: when you use human scoring to train AI, you train it to make humans feel good rather than necessarily be truly helpful. Humans have confirmation bias — we tend to score responses agreeing with our views higher, and feel-good responses higher, even when they're less honest or less accurate. Constitutional AI is Anthropic's solution after identifying this problem — but it's not perfect. Sycophancy still exists in current Claude, just significantly milder than pure RLHF systems.
03 · How does it affect me?
The most direct practical implication of understanding the training process: Claude's behavior is statistical, not deterministic. Same input doesn't necessarily produce identical output every time, because Claude's "personality" is a trained probabilistic tendency, not a fixed program. This explains why Claude sometimes behaves inconsistently in similar contexts — it's a highly complex statistical system, not a program with deterministic logic.
04 · What should I do?
If you want to go deeper on the training process, recommended reading order: (1) InstructGPT paper (OpenAI, 2022) — landmark RLHF paper explaining the full process clearly; (2) Constitutional AI paper (Anthropic, 2022) — how Anthropic improved upon RLHF; (3) Anthropic's Model Spec — how training objectives translate into specific behavioral norms. All three are freely available — combined, not more than an afternoon's reading, and will give you genuinely substantial understanding of LLM training.
Diagram
Feel free to share. Please credit the source.
Generate Share Card
Claude Mefundamentals
How Training Shapes Claude's Personality: The Complete Path From Pre-training to RLHF to Constitutional AI
•Pre-training: broad knowledge but no personality — purely a statistical "text mirror"
•SFT: first behavioral shaping — learns basic answer style and format
•RLHF: genuinely learns "how to answer better" — but creates sycophancy as a side effect
•Constitutional AI: fixes sycophancy — honesty matters more than pleasing you within a principled framework
•Claude's personality is a statistical tendency, not deterministic — same input doesn't guarantee same output
The Missing Link
Claude's "honesty tendency" isn't an engineer-configured switch — it's a direct product of the Constitutional AI training stage: explicit honesty principles in the "constitution" create a systematic preference for truth over pleasing responses.