Glossary · core-concepts

Token

core-concepts 新手

30-Second Version · For the impatient

The smallest unit AI language models use to process text. Tokens don't equal words — English averages roughly 0.75 words per token; Chinese roughly 1-2 characters per token. API costs are billed per token: how many you input plus how many Claude outputs equals the cost of this conversation. Understanding tokens helps you more effectively control AI costs and Context Window usage efficiency.

Full Explanation +

01 · What is this?

Token is the basic unit AI language models use to process text, but it doesn't exactly equal the "words" or "characters" we use in daily speech. Modern LLMs typically use an algorithm like BPE (Byte Pair Encoding) to split text into "subword" units — finer than words but coarser than letters.

Token efficiency by language: English averages roughly 3/4 of a word per token. "Hello, how are you?" is about 6 tokens. Chinese characters are typically 1-2 tokens each. Chinese information density is higher; the same meaning usually uses fewer tokens than English (though this depends on model vocabulary design). Code token efficiency depends on the language.

Intuitive scale: Claude's 200,000 token Context Window equals roughly a 400-page novel (English), or 8 continuous hours of verbatim conversation transcript. Most everyday tasks use far less than this limit.

Tokenizer tool: Anthropic provides an online Tokenizer tool where you can paste text and see how many tokens it becomes — useful for estimating API costs and Context Window usage.

02 · Why does it exist?

How is API token cost calculated? What are practical token-saving methods?

Cost calculation (Claude Sonnet 4.5 example): input tokens ~$3/1M; output tokens ~$15/1M. Output tokens cost 5× more than input, meaning you should carefully set max_tokens (don't set too large) and explicitly tell Claude "brief replies are fine" when the task allows — this significantly reduces output costs.

Main token sources in each API call: System Prompt (transmitted every call, fixed cost); conversation history (multi-turn accumulation, grows over time); current user input.

Effective token-saving methods: streamline System Prompt (remove unnecessary explanations; a 500-token System Prompt vs 2,000-token one with similar effect saves 1,500 tokens per call); manage conversation history (don't accumulate unboundedly; use sliding window or periodic summary compression); enable Prompt Caching (add cache_control when System Prompt exceeds 1,024 tokens — cache hits reduce input cost 90%); specify output length explicitly ("answer in 3 bullet points, max 20 words each" dramatically reduces unnecessary output tokens).

03 · How does it affect your decisions?

Why do the same content sometimes have very different token counts? Are there rules for estimation?

Token counts are affected by several factors:

Language differences: English token efficiency is usually lower than Chinese — the same concept may need more tokens in English. But Chinese comments in code usually use more tokens than English comments (due to Chinese character token splitting).

Punctuation and whitespace: punctuation and whitespace also consume tokens. Many blank lines and extra spaces are part of token costs.

Practical estimation rule: English text ~750 tokens per 1,000 words; Chinese text ~1,000-1,500 tokens per 1,000 characters (high character density but may use more tokens per character); code ~250-500 tokens per 1,000 characters (code has many special characters and keywords with higher token efficiency).

Most accurate method: use Anthropic's Tokenizer tool to calculate directly.

04 · What should you do?

Token, Context Window, and Memory are often mixed together — what's their relationship?

Three related but different-level concepts:

Token: the basic unit AI uses to process text — a measurement unit. Like 'centimeter' is a unit of length, token is a unit of text length.

Context Window: the maximum amount of text (measured in tokens) AI can see in one conversation. Claude's 200,000 token Context Window means all inputs in one conversation (System Prompt + conversation history + current question) can't exceed 200,000 tokens total.

Memory: personalized information AI saves and uses across conversations (your name, preferences, background). Completely different from Context Window: Context Window is "short-term memory within this conversation"; Memory is "long-term memory across conversations." When a conversation ends, Context Window content disappears; but Memory content remains available in the next conversation.

Relationship: at the start of each conversation, Memory loads relevant memory summaries into the Context Window, consuming some tokens. Remaining Context Window space holds System Prompt, conversation history, and current input. Understanding this relationship explains why Claude starts "forgetting" earlier things in long conversations — the Context Window fills up and earliest content gets pushed out.

Real-World Example +

A marketing director built a customer email reply tool with the Claude API for her company. She found monthly API bills 40% higher than expected and wanted to understand why.

Analyzing the bill through the lens of tokens:

Problem 1: System Prompt too long. Her System Prompt was 3,200 tokens (detailed brand guidelines, tone instructions, various exception handling). Every API call transmits these 3,200 tokens — 500 calls/day = 1.6M tokens daily just from System Prompt. Fix: streamline to 800 core tokens; move edge case descriptions to reference documents. Daily savings: 1.2M tokens = ~$180/month cost reduction.

Problem 2: Prompt Caching not enabled. Adding fixed customer background data pushed total System Prompt above 1,024 tokens, enabling Prompt Caching — 85% cache hit rate reduces input costs ~30%.

Problem 3: max_tokens set too large. Set to 2,000, but actual email replies averaged 300 tokens. Adding "keep replies under 150 words" instruction reduced average output from 300 to 180 tokens.

Combined: monthly cost from $420 to $195, 54% reduction with almost no output quality impact.

Common Misconceptions +

✕ Misconception 1

× Misconception 1: Token equals 'character' — 1,000 tokens = 1,000 characters. Tokens don't equal characters. English averages 1 token ≈ 0.75 words (so 1,000 English words ≈ 1,333 tokens); Chinese is roughly 1-2 tokens per character. Different models may use different tokenizers, so the same text may have different token counts in different models. Most accurate calculation: use Anthropic's Tokenizer tool, or check usage.input_tokens and usage.output_tokens in API responses.

✕ Misconception 2

× Misconception 2: Setting larger max_tokens makes Claude output more and more complete. max_tokens is an output ceiling, not a target length. Claude stops when it considers the answer complete regardless of max_tokens. Setting max_tokens = 4,000 won't make Claude automatically generate 4,000 tokens — if it can answer well in 300 tokens, it outputs only 300. max_tokens' main purpose: prevent Claude from generating output beyond your expected length, and control cost ceilings.

The Missing Link +

Direct Impact

Token billing model's core trade-off: flexibility vs predictability. Per-token billing makes costs precisely correspond to actual usage — short Q&A costs almost nothing; complex long document analysis costs more. But this also means costs are hard to fully predict, especially for applications with heavy user input (you can't control how long users' questions are). For applications requiring cost predictability, additional control mechanisms are needed: limiting maximum user input length, setting per-user daily token quotas, monitoring abnormally high-usage requests. Design cost management into system architecture rather than discovering problems when bills arrive.

← Previous Term

Temperature

Ask a Question

Related Terms

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →