Glossary · Core Concepts

Prompt Caching

Core Concepts Intermediate

30-Second Version · For the impatient

An <a href="/en/glossary/core-concepts/anthropic/">Anthropic</a> API feature letting frequently reused prompt prefixes (like fixed System Prompts, long background documents) be computed only once; subsequent calls read the cached version at 90% lower cost and reduced latency. For applications with long fixed System Prompts, enabling Prompt Caching typically immediately saves 20-40% of total API costs.

Full Explanation +

01 · What is this?

How does Prompt Caching work: what is a "cache prefix" and why does it dramatically reduce costs?

Claude API billing is per Token — each API call's input token count determines input cost; output token count determines output cost.

In many applications, each API call's "first half" (System Prompt, background documents, role setup) is identical; only the "second half" (user's question) changes each time. Without Prompt Caching, every API call sends the complete System Prompt to Anthropic's servers for recomputation — even if identical.

Prompt Caching works by: on the first API call, Anthropic's servers compute and store the System Prompt's "intermediate computation state" (KV Cache); subsequent calls just read this cache without recomputing. Cache read cost is 10% of original computation cost — meaning those tokens' cost drops 90% on cache hits.

Cache TTL is 5 minutes, reset on each hit. As long as your API call frequency exceeds once every 5 minutes, the cache stays effective.

Example cost calculation: 3,000-token System Prompt + 100-token user question. Original: (3,000 + 100) × $0.003/1K = $0.0093/call. With caching on cache hit: 3,000 × $0.003 × 10% + 100 × $0.003 = $0.0009 + $0.0003 = $0.0012/call — 87% savings.

02 · Why does it exist?

How do you correctly implement Prompt Caching in the API? What are common implementation mistakes?

Correct implementation:

response = client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {
            "type": "text",
            "text": "Your fixed <a href="/en/glossary/prompt-techniques/system-prompt/">System Prompt</a> content (must exceed 1,024 tokens)",
            "cache_control": {"type": "ephemeral"}  # This enables caching
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

# Check cache status
print(response.usage.cache_creation_input_tokens)  # First call: has value
print(response.usage.cache_read_input_tokens)       # Subsequent: has value

Common implementation mistakes:

Mistake 1: System Prompt too short (under 1,024 tokens) — Anthropic's minimum cache requirement is 1,024 tokens; adding cache_control has no benefit.

Mistake 2: Cache marker added to content that changes every time. cache_control is effective for all content before the marked text, and cache only hits when content is exactly identical. If you mark content containing dynamic user information, the cache will never hit.

Mistake 3: Not verifying cache is actually hitting. Check cache_read_input_tokens in the response usage field; if always 0, cache isn't hitting — check configuration.

03 · How does it affect your decisions?

What application scenarios are best suited for Prompt Caching? How to calculate expected cost savings?

Best scenarios for Prompt Caching:

Multi-turn conversation applications — same System Prompt each round, only user messages and assistant replies accumulate. One of the highest-benefit caching scenarios since the System Prompt only needs computing once per conversation.

Knowledge base Q&A (RAG) — long knowledge base document in System Prompt; different users' questions use the same document as context. Knowledge base documents may be 50,000-100,000 tokens; caching savings are very significant.

Batch Processing — analyzing large volumes of documents (100 contracts, 1,000 emails) with identical analysis instructions. Instructions serve as cache prefix; each document as user input.

Cost savings estimation formula:

Expected monthly savings = (System Prompt tokens × call count × cache hit rate × 0.9 × input rate)

Simplified estimate: 2,000-Token System Prompt, 10,000 monthly API calls, 90% cache hit rate (call frequency higher than once per 5 min), Sonnet 4.5 ($0.003/1K input): Monthly savings ≈ 2,000 × 10,000 × 0.9 × 0.9 × $0.003 / 1,000 = $48.6.

This shows: for high-frequency API applications, Prompt Caching savings are substantial — far exceeding setup costs (a few minutes of code changes).

04 · What should you do?

How to combine Prompt Caching with other cost optimization strategies (model downgrading, output length control) for best effect?

Prompt Caching is one of several main Claude API cost optimization tools, working better in combination:

Combination 1: Prompt Caching + Tiered routing

Use Haiku for classification routing (determining request complexity), then select Sonnet or Opus by complexity. Enable Prompt Caching separately for each model. This combination typically reduces costs 60-75%: Haiku's low cost handles routing, Prompt Caching reduces System Prompt costs for each model, tiered routing reduces high-cost model usage.

Combination 2: Prompt Caching + Conversation history management

For long conversations, control history within reasonable bounds (compress old history with summaries) while enabling caching for System Prompt. Caching handles fixed prefix costs; history management controls dynamic part growth.

Combination 3: Prompt Caching + Batch API

Anthropic's Batch API lets you submit many requests at once at 50% of real-time API cost (processing takes longer; suitable for non-real-time tasks). Combined with Prompt Caching, Batch Processing cost savings can reach 80-90%.

Practical impact on your API costs: if your monthly Claude API costs exceed $100, evaluate optimizations in this order: first check if Prompt Caching is enabled (5-minute setup, fastest savings); then analyze request distribution to evaluate tiered routing feasibility; finally consider Batch API for non-real-time tasks.

Real-World Example +

A legal tech SaaS product using Claude for contract analysis — illustrating Prompt Caching ROI calculation in a real product:

Product background: core feature is "upload contract, automatically identify key clauses and risks." Each analysis call's System Prompt includes: role setup (100 tokens) + analysis framework description (500 tokens) + legal terminology dictionary (1,800 tokens) + output format specification (600 tokens) = 3,000 tokens total. Average uploaded contract: 5,000 tokens.

Before Prompt Caching: each API call input = 3,000 + 5,000 = 8,000 tokens. At Sonnet 4.5's $0.003/1K: input cost per call = $0.024. 5,000 monthly analyses: $120/month input costs.

After Prompt Caching: System Prompt on cache hit: 3,000 × $0.003 × 10% = $0.0009. Contract portion (5,000 tokens, different each time): 5,000 × $0.003 = $0.015. Total input per call = $0.0159 (assuming 90% cache hit rate). 5,000 monthly analyses: $79.5/month — $40.5 savings, 34% reduction.

Engineering time to set up: ~15 minutes (a few lines of code). At $40.5/month savings, this 15-minute investment pays back more than any reasonable hourly rate in the first month alone.

Diagram

Feel free to share. Please credit the source.

Common Misconceptions +

✕ Misconception 1

× Misconception 1: Prompt Caching is only for large enterprises; small-scale users can't benefit. Prompt Caching works for any API user with a fixed System Prompt over 1,024 tokens, regardless of scale. Even with only 1,000 monthly API calls, if your System Prompt is 5,000 tokens, caching saves ~40% of costs. There's only one threshold: System Prompt must exceed 1,024 tokens. If your System Prompt is short, consider adding background documents or a knowledge base to push it over the threshold.

✕ Misconception 2

× Misconception 2: Prompt Caching has a 5-minute limit, so it only works for high-frequency API calls; low-frequency applications can't benefit. The 5 minutes is the cache's automatic expiration time, but each hit resets the timer. If your application makes API calls every 3-4 minutes, the cache stays continuously effective. Even if some calls miss the cache (requiring recomputation), only that call lacks savings — other hits aren't affected. For intermittent use (like a few concentrated usage periods per day), Prompt Caching still provides continuous savings during active periods.

The Missing Link +

Direct Impact

Prompt Caching is almost purely beneficial with no significant downside — simple setup (a few lines of code), output quality unaffected (what's cached is the input computation state, not the output), completely transparent usage (Claude behaves identically, just costs less). The only trade-off to note: cache has a 5-minute TTL; for very low-frequency API calls (more than 5 minutes between calls), cache hit rate may be low with limited actual savings. Secondary consideration: System Prompt must exceed 1,024 tokens to cache. If your System Prompt is short but works well, artificially inflating it to reach the cache threshold may not be a good idea — a longer System Prompt has higher costs even with caching, and may not be more cost-effective than a short one.

← Previous Term

Multimodal

Next Term →

Retrieval-Augmented Generation (RAG)

Ask a Question

Related Terms

Useful Resources

Claude API Status → Model Pricing → Prompt Playground → Token Counter → MCP Servers → LLM Benchmarks → Model Comparison →