How does Prompt Caching work: what is a "cache prefix" and why does it dramatically reduce costs?
Claude API billing is per token — each API call's input token count determines input cost; output token count determines output cost.
In many applications, each API call's "first half" (System Prompt, background documents, role setup) is identical; only the "second half" (user's question) changes each time. Without Prompt Caching, every API call sends the complete System Prompt to Anthropic's servers for recomputation — even if identical.
Prompt Caching works by: on the first API call, Anthropic's servers compute and store the System Prompt's "intermediate computation state" (KV Cache); subsequent calls just read this cache without recomputing. Cache read cost is 10% of original computation cost — meaning those tokens' cost drops 90% on cache hits.
Cache TTL is 5 minutes, reset on each hit. As long as your API call frequency exceeds once every 5 minutes, the cache stays effective.
Example cost calculation: 3,000-token System Prompt + 100-token user question. Original: (3,000 + 100) × $0.003/1K = $0.0093/call. With caching on cache hit: 3,000 × $0.003 × 10% + 100 × $0.003 = $0.0009 + $0.0003 = $0.0012/call — 87% savings.
How do you correctly implement Prompt Caching in the API? What are common implementation mistakes?
Correct implementation:
response = client.messages.create(
model="claude-sonnet-4-5",
system=[
{
"type": "text",
"text": "Your fixed System Prompt content (must exceed 1,024 tokens)",
"cache_control": {"type": "ephemeral"} # This enables caching
}
],
messages=[{"role": "user", "content": user_question}]
)
# Check cache status
print(response.usage.cache_creation_input_tokens) # First call: has value
print(response.usage.cache_read_input_tokens) # Subsequent: has value
Common implementation mistakes:
Mistake 1: System Prompt too short (under 1,024 tokens) — Anthropic's minimum cache requirement is 1,024 tokens; adding cache_control has no benefit.
Mistake 2: Cache marker added to content that changes every time. cache_control is effective for all content before the marked text, and cache only hits when content is exactly identical. If you mark content containing dynamic user information, the cache will never hit.
Mistake 3: Not verifying cache is actually hitting. Check cache_read_input_tokens in the response usage field; if always 0, cache isn't hitting — check configuration.
What application scenarios are best suited for Prompt Caching? How to calculate expected cost savings?
Best scenarios for Prompt Caching:
Multi-turn conversation applications — same System Prompt each round, only user messages and assistant replies accumulate. One of the highest-benefit caching scenarios since the System Prompt only needs computing once per conversation.
Knowledge base Q&A (RAG) — long knowledge base document in System Prompt; different users' questions use the same document as context. Knowledge base documents may be 50,000-100,000 tokens; caching savings are very significant.
Batch processing — analyzing large volumes of documents (100 contracts, 1,000 emails) with identical analysis instructions. Instructions serve as cache prefix; each document as user input.
Cost savings estimation formula:
Expected monthly savings = (System Prompt tokens × call count × cache hit rate × 0.9 × input rate)
Simplified estimate: 2,000-token System Prompt, 10,000 monthly API calls, 90% cache hit rate (call frequency higher than once per 5 min), Sonnet 4.5 ($0.003/1K input): Monthly savings ≈ 2,000 × 10,000 × 0.9 × 0.9 × $0.003 / 1,000 = $48.6.
This shows: for high-frequency API applications, Prompt Caching savings are substantial — far exceeding setup costs (a few minutes of code changes).
How to combine Prompt Caching with other cost optimization strategies (model downgrading, output length control) for best effect?
Prompt Caching is one of several main Claude API cost optimization tools, working better in combination:
Combination 1: Prompt Caching + Tiered routing
Use Haiku for classification routing (determining request complexity), then select Sonnet or Opus by complexity. Enable Prompt Caching separately for each model. This combination typically reduces costs 60-75%: Haiku's low cost handles routing, Prompt Caching reduces System Prompt costs for each model, tiered routing reduces high-cost model usage.
Combination 2: Prompt Caching + Conversation history management
For long conversations, control history within reasonable bounds (compress old history with summaries) while enabling caching for System Prompt. Caching handles fixed prefix costs; history management controls dynamic part growth.
Combination 3: Prompt Caching + Batch API
Anthropic's Batch API lets you submit many requests at once at 50% of real-time API cost (processing takes longer; suitable for non-real-time tasks). Combined with Prompt Caching, batch processing cost savings can reach 80-90%.
Practical impact on your API costs: if your monthly Claude API costs exceed $100, evaluate optimizations in this order: first check if Prompt Caching is enabled (5-minute setup, fastest savings); then analyze request distribution to evaluate tiered routing feasibility; finally consider Batch API for non-real-time tasks.
A legal tech SaaS product using Claude for contract analysis — illustrating Prompt Caching ROI calculation in a real product:
Product background: core feature is "upload contract, automatically identify key clauses and risks." Each analysis call's System Prompt includes: role setup (100 tokens) + analysis framework description (500 tokens) + legal terminology dictionary (1,800 tokens) + output format specification (600 tokens) = 3,000 tokens total. Average uploaded contract: 5,000 tokens.
Before Prompt Caching: each API call input = 3,000 + 5,000 = 8,000 tokens. At Sonnet 4.5's $0.003/1K: input cost per call = $0.024. 5,000 monthly analyses: $120/month input costs.
After Prompt Caching: System Prompt on cache hit: 3,000 × $0.003 × 10% = $0.0009. Contract portion (5,000 tokens, different each time): 5,000 × $0.003 = $0.015. Total input per call = $0.0159 (assuming 90% cache hit rate). 5,000 monthly analyses: $79.5/month — $40.5 savings, 34% reduction.
Engineering time to set up: ~15 minutes (a few lines of code). At $40.5/month savings, this 15-minute investment pays back more than any reasonable hourly rate in the first month alone.
Prompt Caching is almost purely beneficial with no significant downside — simple setup (a few lines of code), output quality unaffected (what's cached is the input computation state, not the output), completely transparent usage (Claude behaves identically, just costs less). The only trade-off to note: cache has a 5-minute TTL; for very low-frequency API calls (more than 5 minutes between calls), cache hit rate may be low with limited actual savings. Secondary consideration: System Prompt must exceed 1,024 tokens to cache. If your System Prompt is short but works well, artificially inflating it to reach the cache threshold may not be a good idea — a longer System Prompt has higher costs even with caching, and may not be more cost-effective than a short one.