Prompt compression is a collective term for techniques aimed at reducing input token count per API call while maintaining output quality — reducing costs and improving response speed.
Why prompt compression is needed: Claude API charges per token — input plus output equals cost. Input token costs are lower than output (Sonnet 4.6 input $3/M, output $15/M), but in high-frequency applications, accumulated input costs are substantial. A 2,000-token System Prompt at 10,000 daily API calls = 20M tokens daily from System Prompt alone, ~$180/month. Streamlining to 800 tokens with same effect = ~$72/month, saving $108.
Four main prompt compression directions: System Prompt streamlining (remove redundant explanations, keep only necessary rules); conversation history compression (compress old turns with summaries); document truncation and filtering (only include most relevant parts); structured input (replace verbose natural language with structured formats — same information in fewer tokens).
How do you effectively streamline System Prompts while ensuring output quality doesn't degrade?
Principle 1: Use examples instead of rule explanations. Verbose version (~80 tokens): long explanation of tone requirements. Streamlined version (~30 tokens): 'Tone: friendly, avoid technical jargon, suitable for medium-technical readers. Example: [short example showing preferred phrasing].' Uses ~60% fewer tokens for same effect.
Principle 2: Delete background information that doesn't affect behavior. Much System Prompt contains company introductions and context explanations Claude doesn't need to complete tasks. Keep only information that genuinely affects Claude's behavior (how to speak, what to do, what not to do).
Principle 3: Use structured formats instead of paragraphs. A 100-token paragraph explanation can usually achieve the same or better effect with a 40-token bullet list — structured formats are easier for Claude to 'scan and follow.'
Verification method: after streamlining, compare output quality before and after using 10-20 most common test cases. If gap is acceptable, it's effective compression. If quality degrades, identify which deleted rules caused it and selectively restore.
How should multi-turn conversation history compression work? When to compress, and how?
Conversation history is the fastest-growing Prompt component — each turn accumulates, and without control, long conversations eventually fill the entire Context Window.
When to compress: set a trigger condition — 'conversation history exceeds 15 turns' or 'current Context usage exceeds 50%.' Don't compress before triggers (compression requires extra API calls); execute compression once triggered.
How to compress:
Method 1 — Rolling summary: have Claude generate a summary of the earliest N turns ('In previous conversation, you told me your requirement is X, we decided on approach Y, main design decisions are A, B, C...'). Replace those turns' originals with this summary; new turns continue accumulating after.
Method 2 — Important information retention: importance assessment per turn. Turns containing key decisions, explicit user preferences, important error corrections: retain originals. Turns that are just process discussion or repeated confirmation: compress to one-line summary.
Compression prompt example: 'Here are the first 10 turns of the conversation. Generate a summary under 150 words including: 1. User's core requirements; 2. Confirmed important decisions; 3. Any important commitments I (Claude) made. Process discussion not needed.'
Note: compressed summary should be placed at the very beginning of conversation history (after System Prompt) so Claude reads context summary before continuing conversation.
Beyond System Prompt and conversation history, how do you control document input tokens?
In applications analyzing long documents, documents themselves are often the largest token consumption source. Several effective control methods:
Method 1 — Layered filtering: don't input entire documents. First do a 'relevance filter' — use Haiku (very low cost) for a quick paragraph-level relevance scoring ('which paragraphs are most relevant to this question? Score 1-5'). Input only paragraphs scored 4+ into the main API call. This extracts the most relevant 2,000-3,000 tokens from a 10,000-token document for Sonnet or Opus analysis.
Method 2 — Structured extraction: for fixed-format documents (contracts, financial reports, resumes), pre-process by extracting key fields into structured format rather than including full original text. A 5,000-token contract may only need 500 tokens after structured extraction of key information.
Method 3 — Sliding window: for very long documents, design a sliding window — each call includes only the most relevant 'window' (e.g., ±2,000 tokens around the relevant section) rather than the entire document. This maintains relevance while keeping per-call input tokens within a fixed range.
A legal tech company's AI contract review application processing 8,000 contracts monthly — illustrating prompt compression's real cost impact:
Before compression: System Prompt 3,200 tokens, each contract ~8,000 tokens (full text), Sonnet 4.6: ~$0.056/contract, 8,000/month: $448.
Three compression measures: Streamline System Prompt (3,200 → 900 tokens, effect testing confirms acceptable gap); Structured contract key info extraction (8,000 full text → Haiku extracts 2,500-token structured key info, Haiku cost ~$0.0025/contract, negligible); Enable Prompt Caching (add 400-token legal terminology dictionary to reach 1,300 tokens, enable caching).
After compression: System Prompt 1,300 tokens (but cached, 10% cost on cache hits); contract input 2,500 tokens. Effective input cost: ~$0.0079/contract (85% cache hit rate) + $0.0025 Haiku preprocessing + $0.0225 output cost = ~$0.033/contract total. 8,000/month: $264, saving 41% from $448.
Prompt compression's core trade-off: cost optimization vs engineering investment and maintenance cost. Effective prompt compression takes time: analyzing what's redundant, testing post-compression effectiveness, designing conversation history compression logic, maintaining structured document extraction pipelines. This work has cost. Simple criterion for whether prompt compression is worthwhile: if expected monthly savings × 12 (annual savings) > engineering time cost of compression, it's worth doing. For low-cost small applications, high-priority prompt compression is low; for high-cost, long-running applications, prompt compression is a habit that should be established early — not optimization added late.