Glossary · prompt-techniques

Prompt Caching

prompt-techniques Advanced

30-Second Version · For the impatient

An Anthropic API feature that caches repeated System Prompts or context after first processing; subsequent requests pay only 10% of the normal cost to re-read the cache rather than recomputing the entire token sequence. For applications with System Prompts exceeding 1,024 tokens, this feature can immediately reduce API costs by 20-40%.

Full Explanation +

01 · What is this?

Prompt Caching is an Anthropic API feature that lets you mark portions of your API request as "this text is static, please cache it." Once a text segment is cached, subsequent API calls reading that cached segment pay only 10% of the original computation cost — meaning the same System Prompt costs full price the first time, then 10% for every subsequent read. Most intuitive understanding: think of your System Prompt as a work manual. Without caching, every API call has Claude read the complete manual from start to finish (full price). With caching, after Claude reads it the first time, subsequent calls just "verify the manual is the same" (10% cost), dramatically reducing per-call processing overhead. Activation requirements: the segment marked for caching must exceed 1,024 tokens (Claude 3 Haiku) or 2,048 tokens (Claude 3.5 Sonnet and Opus). Cache lifetime is 5 minutes, resetting on each use.

02 · Why does it exist?

Prompt Caching use cases by benefit, from high to low: **Highest benefit**: large fixed System Prompts (over 2,000 tokens) with high daily call volumes. Per-call savings multiplied by daily call count creates substantial cumulative savings. **Medium benefit**: applications injecting reference documents into Context (e.g., RAG systems with background documents in the System Prompt). These documents are often long, and caching significantly reduces costs. **Low benefit or not applicable**: very short System Prompts (below threshold); System Prompt content differs per call (low cache hit rate); very low call frequency (5-minute cache lifetime means low hit rates for infrequent calls).

03 · How does it affect your decisions?

Technical implementation details for Prompt Caching: ```python response = client.messages.create( model="claude-sonnet-4-5", system=[ { "type": "text", "text": "[Your static System Prompt text, over 1,024 tokens]", "cache_control": {"type": "ephemeral"} # mark for caching } ], messages=[{"role": "user", "content": user_message}], ) ``` Important: only the static portions of your System Prompt (parts identical across every call) are suitable for caching. If your System Prompt has dynamic content (current date, user name), place dynamic content after the static portion and only enable cache_control for the static portion. Monitoring cache hit rate: in the API response's `usage` field, `cache_creation_input_tokens` (first computation) and `cache_read_input_tokens` (cache reads) let you measure caching efficiency.

04 · What should you do?

Prompt Caching best practices: **Structure your System Prompt**: place static, universal rules first (this part gets cached); place dynamic, per-call-variable content last (this part doesn't cache). This maximizes cache hit rate. **Mind the cache lifetime**: the 5-minute cache lifetime means if your application has call intervals exceeding 5 minutes in some periods, the cache will expire and the next call must recompute. For applications with irregular call patterns, actual cache savings may be less than expected. **Pair with Prompt Compression**: first trim the System Prompt (reduce total tokens), then enable Caching (compute those trimmed tokens only once) — combining both techniques yields the best results. **Suitable for long-document RAG**: if you have fixed reference documents to inject into Context, place them in the System Prompt with Caching enabled to significantly reduce per-query costs.

Real-World Example +

A legal AI assistant application has a System Prompt containing a complete legal liability disclaimer and behavioral guidelines document, totaling 4,500 tokens. The application's users ask a total of 20,000 questions per day. Without Prompt Caching: 4,500 tokens × 20,000 calls = 90M tokens/day With Prompt Caching (assuming 98% hit rate): First computation: 4,500 tokens × 400 calls (2% misses) = 1.8M tokens; Cached reads: 4,500 × 10% × 19,600 calls = 8.82M tokens; Total: 10.62M tokens (11.8% of original) System Prompt costs drop from ~$27/day to ~$3.2/day — approximately $710/month savings. No change to user experience; only adding a cache_control marker to API requests.

Diagram

Feel free to share. Please credit the source.

Common Misconceptions +

✕ Misconception 1

× Misconception 1: Prompt Caching affects Claude's response quality because it's "using cached answers." Prompt Caching caches the "computation result" of the System Prompt (the model's internal representation of the System Prompt), not Claude's answers. Each call still has Claude generate a fresh response based on this cached context plus your current question. Caching doesn't affect output diversity or quality — only computation costs.

✕ Misconception 2

× Misconception 2: Any System Prompt exceeding 1,024 tokens definitely warrants enabling Prompt Caching. Whether it's worthwhile depends on call frequency and cache hit rate. If your application's call frequency is very low (e.g., only a few calls per hour), the 5-minute cache lifetime means most calls are cache misses that pay first-computation prices — minimal benefit. Prompt Caching is best suited for high-frequency applications (multiple calls per minute).

The Missing Link +

Direct Impact

Prompt Caching has almost no notable trade-offs — it's pure cost optimization with no impact on output quality and very low implementation cost (just add a marker in API requests). Main consideration: cache expiration impact. If your application has long idle periods (over 5 minutes), the first call after cache expiration pays full price. For applications with very irregular call patterns, actual savings may be less than expected. Additionally, caching only applies to static System Prompt portions — if your application needs to dynamically generate different System Prompts each time, Prompt Caching doesn't apply.

Ask a Question

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →