Glossary · core-concepts

Model Distillation

core-concepts Advanced

30-Second Version · For the impatient

Using the outputs of a large "teacher" model to train a smaller "student" model, allowing the small model to retain the large model's core capabilities while dramatically reducing computational requirements. Like having a senior expert intensively mentor a junior colleague, compressing years of tacit knowledge — the junior reaches 80% of the expert's capability at just 10% of the "size."

Full Explanation +

01 · What is this?

Model Distillation is a training technique where small models learn from large model outputs. Core idea: rather than having the student model learn from human-labeled training data from scratch, have it learn "the probability distribution the teacher model outputs for each input."

Why is this more effective? Consider a classification task with a cat as input. Direct supervised learning's training signal is "answer = cat" (hard label); distillation's training signal is the teacher's full output: "cat: 85%, kitten: 8%, dog: 4%..." The distillation signal doesn't just say "the answer is cat" — it implicitly conveys that "cat and kitten are conceptually close, cat and dog are somewhat similar" — these conceptual relationships let the student model acquire richer knowledge with far less training data.

In the LLM domain, lightweight models like Claude Haiku acquire some capabilities through distillation from Claude Sonnet or Opus — Haiku "observes" Sonnet/Opus responses to various tasks, learning how to produce similar outputs in a lightweight form.

02 · Why does it exist?

Distillation's core advantage lies in the information density provided by "soft targets." A teacher model's probability distribution for some input contains far richer knowledge than any single correct answer. Semantic relationships like "King - Man + Woman ≈ Queen" are implicitly encoded in these soft target probability distributions.

Task-specific distillation is the most common industrial application: you don't need to distill all the teacher's capabilities — only capabilities for specific tasks. For example, if your application only needs sentiment analysis, you can use only teacher model outputs on sentiment analysis tasks to train a student model, producing an extremely lean model that excels at sentiment analysis with millisecond-level inference latency.

Distillation data quality directly determines the student model's ceiling: using Claude Opus as teacher typically produces better students than using Claude Haiku — because stronger teachers provide richer "soft knowledge."

03 · How does it affect your decisions?

Model distillation's impact on your Claude usage is mainly about understanding why different Claude versions have this capability distribution: Haiku excels on certain tasks but is noticeably weaker than Sonnet on others — not entirely "Haiku's insufficient capability" but different knowledge transfer efficiency across task types during distillation. General tasks (translation, summarization, simple Q&A) distill very well; complex reasoning tasks (multi-step logic, difficult code) have higher distillation loss.

For developers, if your application has very specific task requirements with cost and latency constraints, consider using high-quality outputs from Claude Opus or Sonnet to distill-train a lightweight model specialized for your task — this is many production AI applications' actual approach.

Note API Terms of Service: using Claude outputs to train models for your own specific business tasks is generally permitted, but not for training models that directly compete with Anthropic. Confirm the latest terms before proceeding.

04 · What should you do?

If you want to try distillation training using Claude outputs, practical recommendations:

Generate high-quality distillation data: ensure teacher model (Claude Opus/Sonnet) output quality is sufficiently high — distillation data quality directly determines the student model's ceiling. Include diverse inputs (don't use just one question type) to help the student model develop better generalization.

Choose the right student architecture: common choices are BERT-based (suited for classification, NER) or GPT-based small models (suited for generation tasks). For edge device deployment, consider lightweight base models like DistilBERT or Phi-3-mini.

Use standard distillation frameworks: HuggingFace's trl library supports SFT (supervised Fine-Tuning) and KD (knowledge distillation) — the most mature open-source choice currently. OpenAI also provides a distillation API for directly training GPT-4o-mini on GPT-4 outputs — same principle.

Real-World Example +

A legal tech company wants to build a lightweight "contract clause risk identification" model needing to identify high-risk clauses the moment a user uploads a contract (low latency is the key requirement). Using Claude Opus directly has too high latency (3-5 seconds per clause) and unacceptable costs.

Their distillation approach: Step 1, collect 5,000 contracts, have Claude Opus perform detailed risk analysis of each clause generating high-quality "teacher outputs" (risk level + reasoning + relevant regulations). Step 2, use these Claude Opus outputs to distillation-train a small BERT-based model. Step 3, deploy the distilled model: latency < 200ms (15-20× faster than Claude Opus), 98% cost reduction, 91% of Claude Opus accuracy on standard contract clauses. For high-complexity clauses with low confidence, the system calls Claude Opus for deep analysis — the two-layer architecture balances speed, cost, and accuracy.

Diagram

Feel free to share. Please credit the source.

Common Misconceptions +

✕ Misconception 1

× Misconception 1: Distilled student models can surpass teacher models. Distillation's ceiling is the teacher model's capability — students can at best reach the teacher's level in distilled tasks, usually with some loss (70-90%). Distillation can get small models to most of large model capability, but not beyond. For stronger models, you need stronger teachers or better training methods, not better distillation techniques.

✕ Misconception 2

× Misconception 2: Distillation is just copying teacher model answers without genuine learning. The fundamental difference between distillation and simple copying: "soft targets" — the teacher's full probability distribution contains richer information than any single answer — it implicitly encodes conceptual similarities and relationships. The student, by learning soft targets, gains "understanding of conceptual structure" rather than just "answer memorization." This is why distillation outperforms directly using teacher outputs for SFT (supervised fine-tuning).

The Missing Link +

Direct Impact

Model distillation's core trade-off: "capability loss vs efficiency gain." Distillation inevitably produces some capability loss (student < teacher), but delivers smaller model size, lower inference latency, and cheaper operating costs. On specific tasks, distillation can make this trade-off very worthwhile (5-10% accuracy loss in exchange for 90%+ cost reduction). Unsuitable for distillation: tasks requiring maximum accuracy with no capability loss tolerance; very broad task scope (general assistant) where maintaining wide-ranging capabilities in small models is difficult. Best suited for distillation: specific tasks, high-frequency calls, production applications sensitive to latency and cost.

← Previous Term

LLM (Large Language Model)

Next Term →

Multimodal

Ask a Question

Related Terms

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →