Glossary · Prompt Techniques

Prompt Compression

Q: Why does Prompt Compression matter?

**When do I need prompt compression? Is it irrelevant for normal conversations?** For a single-turn or short conversation, yes, compression is usually irrelevant. The need comes from accumulation: once a conversation has run many rounds, or you loaded a lot of background documents upfront, or you are running a multi-step agentic task, keeping all prior messages in context creates two problems: hitting the model's Context Window limit, or even if you don't, paying Token costs for old messages you no longer need and adding latency. A simple threshold: if your prompt is already over 30-50K tokens, or your agent task is expected to run many rounds, it is worth thinking about compression strategy. For a single question or a two-turn exchange, the marginal gain from compression is near zero.

Q: How does Prompt Compression work?

**When writing compression summaries, how do I make sure I am not losing important information?** A few practical principles. First, keep decisions and conclusions, drop the derivation process. If you and Claude debated at length and settled on PostgreSQL over MongoDB, keep 'decided on PostgreSQL because X,' not the whole comparison thread. Second, keep constraints and limits. Things you told Claude earlier — 'must be under 1000 words,' 'output must be JSON' — must stay verbatim or be fully carried into the summary, because they shape all subsequent outputs. Third, ask yourself whether you might regret dropping something: before compressing a section, ask 'might I need the original detail later?' If the answer is 'not sure,' keep it rather than cutting for compression's sake. The goal is removing what you are certain you don't need, not making everything as short as possible.

Q: How is Prompt Compression applied in practice?

**Are there tools or automated methods for prompt compression so I don't have to do it manually?** Several common automated approaches exist. The simplest is a sliding window: keep only the last N turns in full and truncate everything earlier. This is the bluntest method but sufficient for tasks where contextual continuity is not critical. A step up is AI-assisted summarization: once context reaches a certain length, automatically send the earlier portion to Claude or a lighter model and ask it to condense into key points, then replace that section with the summary. This works well but incurs an extra API call per cycle, which has a cost. More complex systems introduce vector databases (RAG architecture): embedding conversation history and documents as vectors and retrieving only the relevant chunks when needed, rather than loading everything into context. This is the most common architecture for long-running agent systems but also the most complex to implement.

Prompt Techniques Intermediate

30-Second Version · For the impatient

Prompt Compression is the practice of reducing the length of what you send to an AI model without losing critical information. The goal is to prevent long or multi-turn conversations from overflowing the model's <a href="/en/glossary/core-concepts/context-window/">Context Window</a> or generating unnecessary cost and latency. Common techniques include summarizing earlier exchanges into key points, keeping only the most recent turns in full, removing resolved side threads, and replacing long documents with distilled summaries.

Full Explanation +

01 · What is this?

When do I need prompt compression? Is it irrelevant for normal conversations?

For a single-turn or short conversation, yes, compression is usually irrelevant. The need comes from accumulation: once a conversation has run many rounds, or you loaded a lot of background documents upfront, or you are running a multi-step agentic task, keeping all prior messages in context creates two problems: hitting the model's Context Window limit, or even if you don't, paying Token costs for old messages you no longer need and adding latency.

A simple threshold: if your prompt is already over 30-50K tokens, or your agent task is expected to run many rounds, it is worth thinking about compression strategy. For a single question or a two-turn exchange, the marginal gain from compression is near zero.

02 · Why does it exist?

When writing compression summaries, how do I make sure I am not losing important information?

A few practical principles. First, keep decisions and conclusions, drop the derivation process. If you and Claude debated at length and settled on PostgreSQL over MongoDB, keep 'decided on PostgreSQL because X,' not the whole comparison thread. Second, keep constraints and limits. Things you told Claude earlier — 'must be under 1000 words,' 'output must be JSON' — must stay verbatim or be fully carried into the summary, because they shape all subsequent outputs.

Third, ask yourself whether you might regret dropping something: before compressing a section, ask 'might I need the original detail later?' If the answer is 'not sure,' keep it rather than cutting for compression's sake. The goal is removing what you are certain you don't need, not making everything as short as possible.

03 · How does it affect your decisions?

Are there tools or automated methods for prompt compression so I don't have to do it manually?

Several common automated approaches exist. The simplest is a sliding window: keep only the last N turns in full and truncate everything earlier. This is the bluntest method but sufficient for tasks where contextual continuity is not critical.

A step up is AI-assisted summarization: once context reaches a certain length, automatically send the earlier portion to Claude or a lighter model and ask it to condense into key points, then replace that section with the summary. This works well but incurs an extra API call per cycle, which has a cost. More complex systems introduce vector databases (RAG architecture): embedding conversation history and documents as vectors and retrieving only the relevant chunks when needed, rather than loading everything into context. This is the most common architecture for long-running agent systems but also the most complex to implement.

04 · What should you do?

Advanced: how is prompt compression strategy different in agent systems compared to normal conversations?

Agent system compression is considerably more complex than regular conversation, because an agent accumulates large volumes of tool call logs, intermediate results, errors, and retry records during execution — some useful for subsequent steps, some completely unnecessary.

A few agent-specific considerations: first, selective retention of tool output — if an agent queried a database, received 500 rows, and ultimately used 10 rows for a decision, the compressed representation should be 'the 10 rows used for the decision plus the decision conclusion,' not 500 rows of raw output. Second, distinguishing resolved vs unresolved errors: resolved errors can be compressed to 'tried X, failed, used Y, succeeded'; unresolved errors must stay verbatim because they influence subsequent step planning. Third, long-running agents almost certainly need a periodic compression mechanism built in by design, not a reaction to hitting the context limit.

Real-World Example +

Scenario: after 40 rounds of conversation with Claude, you are collaborating on a technical article. The first 30 rounds explored several directions; you ultimately settled on 'security design for MCP servers' as the angle and confirmed a 1200-word length and audience of intermediate developers.

Problem: keeping all 40 turns pushes the context over 60K tokens — expensive, and the rejected early directions are irrelevant.

Compressed context: summary (3 lines): 'We explored several angles and settled on MCP server security design. Main topics are permission control and transport encryption.' Kept verbatim: the confirmed angle, 1200-word limit, audience spec, last three turns in full. Cut: full discussion of every rejected direction.

Result: context shrinks from 60K+ to around 8K. Claude still has everything it needs to continue writing.

Diagram

Feel free to share. Please credit the source.

Common Misconceptions +

✕ Misconception 1

x Myth 1: Prompt compression means making the prompt as short as possible. Shortness is not the goal; fitting all necessary information in the minimum token count is. Over-compression — cutting important details — leaves the model without the context it needs for good decisions. Compression finds the best balance between information completeness and token efficiency, not pure minimization.

✕ Misconception 2

x Myth 2: Summarizing is always safe compression for any content. Summarization is lossy compression — you will lose some details. The problem is you may not know in advance which details will matter later. For critical constraints, confirmed decisions, or precise formatting requirements, a summary may omit exactly what the next step needs most. These should be kept verbatim.

✕ Misconception 3

x Myth 3: Compression is emergency surgery you do only when you hit the context limit. Best practice is to design a compression strategy into your agent or application architecture so it happens automatically on a schedule, not scrambled when the limit is hit. Proactive compression preserves context quality better than reactive truncation and makes cost control easier.

The Missing Link +

Direct Impact

Prompt Compression's core trade-off is token efficiency vs information completeness.

More aggressive compression means lower cost and faster responses but higher risk of losing important details. More complete retention reduces information loss risk but raises cost and latency, and makes context overflow more likely.

There is no universal strategy, because what information will be needed later isn't always knowable at the start of a task. The most common practical compromise: keep the most recent turns verbatim, auto-summarize earlier history, and explicitly mark critical decisions and constraints in a way that prevents them from being compressed. This maintains reasonable quality and efficiency across most tasks while ensuring the most important information survives the summarization process.

Ask a Question

Related Terms

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →