
You're running a chatbot with a 4,000-token system prompt. Every message your user sends costs you the full 4,000 tokens in input — even though that system prompt hasn't changed once. Multiply that by 10,000 daily conversations, and you're burning money on tokens the model has already processed dozens of times before.
That's exactly the problem prompt caching solves. And most developers don't encounter it until they've already spent a few hundred dollars they didn't need to.
Prompt caching isn't new, but the way Claude, Gemini, and OpenAI implement it is different enough that a real comparison actually matters. This article shows you how each provider handles caching, what you'll actually save (with real numbers), how to turn it on in your code, and the limitations that will catch you off guard if you skip them.
What Prompt Caching Actually Does
Prompt caching lets the AI model store and reuse the computed state of a repeated portion of your prompt — specifically, the key-value (KV) attention cache generated when processing those tokens — so it doesn't have to reprocess them from scratch on every request.
Think of it this way: the first time you send a 5,000-token system prompt, the model processes every token, builds internal representations, and generates a response. Without caching, the next request starts from zero — the model reprocesses all 5,000 tokens again. With caching, the model saves the intermediate state from that first pass and reuses it. The second request skips the heavy computation for those tokens and only processes what's new.
The practical result: you pay significantly less for cached tokens than for fresh input tokens, because the model isn't doing the same work twice.
This isn't a gimmick. It's the same principle behind CPU caching, database query caches, and CDN edge caching. The model is expensive to run; reusing computed state is just smart engineering.
Which AI Models Support Prompt Caching — and What It Costs
As of 2026, three major providers support prompt caching: Anthropic (Claude), Google (Gemini), and OpenAI. Each handles it differently, and the pricing structures are not equivalent.
Claude — Cache Write and Read Pricing
Anthropic's implementation is the most explicit. You opt in by adding `cache_control` markers to specific blocks in your prompt. Claude then distinguishes between two token types.
Cache write tokens are charged the first time a cacheable block is processed. These cost *more* than standard input tokens — typically 25% above base price — because the model has to do the work of computing and storing the KV cache.
Cache read tokens are charged on every subsequent request that hits that cached block. These cost significantly less — currently around 90% less than standard input tokens on most Claude models.
The math: pay a small premium once, then pay almost nothing for all future hits. For Claude Sonnet 4, standard input runs $3 per million tokens. Cache writes are roughly $3.75 per million, and cache reads drop to approximately $0.30 per million.
Cache TTL on Claude is 5 minutes. If no request touches the cached content within 5 minutes, the cache evicts and the next request pays cache write prices again. For high-traffic applications this rarely matters, but for low-volume or bursty workloads it's important to factor in. The minimum cacheable block size is 1,024 tokens for most Claude models, and 2,048 tokens for Claude Haiku.
Gemini — Implicit Context Caching
Google takes a different approach with Gemini's context caching. Rather than requiring explicit markers in your API call, Gemini caching works on an explicit API object model — you create a cached content object, get a cache ID, and pass that ID in subsequent requests.
The pricing model is storage-based rather than write/read-based. You pay to store the cached tokens per hour, plus a reduced input price when you reference the cache. Gemini 1.5 Pro charges a fraction of the normal input token cost for cache hits, with a minimum cache size of 32,768 tokens.
That 32K minimum is the critical difference from Claude. Gemini caching is purpose-built for *large context* use cases — long documents, codebases, full conversation histories. If your system prompt is a few thousand tokens, Gemini's caching won't help you. If it's a 100-page document or a 50,000-token codebase, it's extremely effective.
The cache duration is configurable (default 1 hour, extendable), which makes it more predictable for cost planning than Claude's 5-minute TTL.
OpenAI — Automatic Prompt Caching
OpenAI's approach is the most hands-off. Prompt caching on GPT-4o and newer models happens automatically — you don't add any special parameters or create cache objects. OpenAI detects repeated prefixes in your prompts and caches them server-side without any opt-in from you.
The discount is applied automatically to the portion of input tokens served from cache. Currently, cached input tokens on GPT-4o cost 50% less than standard input tokens. It's a smaller discount than Claude's cache reads, but it requires zero code changes — which is genuinely useful for teams that want savings without architectural changes.
The tradeoff: you have limited visibility into whether a given request actually hit the cache, and you can't force cache behavior or guarantee hits. It's entirely managed by OpenAI's infrastructure. The minimum prefix length is 1,024 tokens, scoped per API key.
Here's how the three providers compare:
| Provider | Opt-in Required | Min Tokens | Cache Hit Discount | TTL | Best For |
|---|---|---|---|---|---|
| Claude | Yes (cache_control) | 1,024–2,048 | ~90% off input | 5 minutes | Chatbots, agents, repeated system prompts |
| Gemini | Yes (cached content API) | 32,768 | ~75% off input | 1 hour (configurable) | Large docs, RAG, long context |
| OpenAI | No (automatic) | 1,024 | ~50% off input | Managed by OpenAI | Any use case with repeated prefixes |
When Prompt Caching Saves You Money (and When It Doesn't)
Prompt caching delivers real savings in a specific set of situations. Treating it like a universal cost switch will lead to disappointment.
Chatbots with fixed system prompts are the clearest win. If every conversation starts with the same 2,000–10,000 token system prompt defining the assistant's persona, rules, and context, that entire block becomes cacheable. Across thousands of conversations per day, the savings compound fast.
RAG pipelines benefit when the retrieved context is large and reused. If you're injecting the same product documentation or knowledge base into every query, caching that document block means you pay cache write prices once per TTL window, not once per query.
Multi-turn agents that pass the same tool definitions, instructions, or accumulated conversation history back repeatedly are strong candidates. In agentic frameworks where the system context grows with each step, you can cache the static portions and pay full price only for the dynamic parts.
Code assistants that prepend large codebases or file contents to every query may be the biggest beneficiaries. A 50,000-token codebase injected into every \"explain this function\" request becomes very expensive without caching.
On the other hand, if every request has a unique, never-repeated prompt — creative writing tasks, one-off queries, random user inputs without a shared prefix — there's no repeated content to cache and no savings to realize.
Caching also doesn't help when your prompts are below the minimum token threshold. A 500-token system prompt won't cache on any of the three providers. This catches more teams than expected — many prompts *feel* long but don't actually hit 1,024 tokens.
Low-traffic applications need to account for TTL carefully. If you have 10 users a day, Claude's 5-minute TTL may expire between most requests, meaning you pay cache write prices almost every time with very few cache reads to offset the premium.
Real Cost Calculations: Before and After Caching
Numbers make this concrete. Here's a realistic scenario for a mid-size customer support chatbot.
Setup: System prompt of 3,500 tokens (policies, persona, product knowledge), average user message of 50 tokens, average assistant response of 200 tokens, 5,000 daily conversations, and 4 turns per conversation on average.
Without caching (Claude Sonnet 4 at $3/M input tokens):
Each turn processes the full 3,500-token system prompt plus accumulated conversation history. By turn 4, you're processing roughly 3,500 + conversation history + new message — averaging around 4,050 input tokens per turn.
Total input tokens per day: 5,000 conversations × 4 turns × 4,050 tokens ≈ 81,000,000 tokens
Daily input cost: 81M × $3 / 1M = $243/day → roughly $7,300/month
With caching (cache reads at $0.30/M):
The 3,500-token system prompt is cached after the first request in each 5-minute window. For a 5,000-conversation/day chatbot with reasonable traffic overlap, assume an 80% cache hit rate on the system prompt.
- Cached system prompt tokens: 5,000 × 4 × 3,500 × 0.80 = 56M tokens at $0.30/M = $16.80
- Cache write tokens (20% miss rate): 56M × 0.20 = 11.2M at $3.75/M = $42
- Dynamic tokens (messages + history): 5,000 × 4 × 550 = 11M at $3/M = $33
Total daily cost: ~$92/day → roughly $2,750/month
That's a 62% cost reduction on a realistic mid-traffic chatbot. Push the system prompt larger, increase traffic volume, or improve cache hit rates and you can reach the 85–90% range that Anthropic's documentation describes.
Before you run your own numbers, use the
Caching Gotchas Nobody Warns You About
Most tutorials cover how to turn caching on. They skip the ways it quietly fails.
The 5-minute TTL trap on Claude is the most common surprise. If you're building a low-traffic application or a batch job that runs every 10–15 minutes, you'll almost never hit a warm cache. Worse, you'll pay cache write prices — which are *higher* than standard input — almost every request with rare reads to offset them. Run the math for your specific traffic pattern before assuming savings.
The prefix must be identical. This applies across all three providers. Any dynamic content in your cached block — a timestamp, user ID, version string, even an inconsistent trailing newline — drops your cache hit rate to zero. This is a surprisingly common bug in production systems.
Gemini's 32K minimum is a real architectural constraint. If you're migrating from Claude or OpenAI and expect Gemini to cache a 5,000-token system prompt, it won't. Plan your document injection strategy accordingly — Gemini caching rewards large-context architectures, not typical prompt engineering.
Cache writes cost more than standard input. Claude's cache write pricing is 25% above standard rates. At very low cache hit rates — under 20% — you might spend slightly more than without caching. This is especially relevant during active development when you're iterating on prompts and not accumulating enough hits to break even.
Tool definitions don't always cache cleanly. Depending on how you structure Claude API calls, function/tool definitions may or may not qualify as cacheable content depending on their position in the message structure. Test explicitly and always verify against the usage response — don't assume.
Multi-modal caching has its own rules. Images and documents can often be cached, but token counting for non-text content works differently across providers. If your application uses PDF or image inputs heavily, verify caching behavior for those content types separately from text.
Key Takeaways
- Prompt caching reuses computed attention state rather than reprocessing identical token blocks from scratch — the model stores the heavy lifting, you pay almost nothing to reuse it.
- Claude gives explicit control with the highest discount (~90%) but a 5-minute TTL. Gemini suits large documents (32K+ tokens) with configurable, longer TTLs. OpenAI caches automatically with a 50% discount and zero configuration.
- The biggest wins come from fixed system prompts, shared RAG context blocks, and repeated tool definitions in agentic pipelines.
- Cache write tokens cost *more* than standard input — low-traffic apps or frequently-modified prompts can end up spending more, not less.
- Structure prompts so the static portion always comes first and remains byte-for-byte identical across every request.
Questions People Actually Ask
Does prompt caching work across different users or just within one session?
Cache hits are scoped per API key, not per user session. A system prompt cached by one user's request can serve a completely different user's request from the same API key — which is exactly why it's so valuable for multi-tenant applications. In practice, each user effectively subsidizes the cache for every other user hitting the same endpoint.
What happens if my prompt changes slightly — does the whole cache invalidate?
Yes. Caching works on prefix matching, so any modification to the cached portion forces a full cache rewrite. Structure your prompts so the most stable, universal content appears earliest in the sequence. Dynamic personalization, user-specific context, and anything that varies per request belongs *after* the cached block.
Is prompt caching the same as semantic caching?
No — these are distinct techniques. Prompt caching stores the exact computed KV state of identical token sequences inside the model API. Semantic caching (offered by middleware tools like Langchain or custom Redis layers) stores full responses and serves them when a new query is semantically similar to a prior one. They can be layered together, but semantic caching happens at the application layer and doesn't interact with the model's internal attention mechanism.
Can I cache few-shot examples in my prompt?
Yes, and this is one of the highest-ROI applications of caching. If you have 20 few-shot examples totaling 3,000 tokens at the start of your prompt, those are prime caching candidates. On Claude, place the `cache_control` marker after your examples block. On OpenAI, put the examples in the system message and keep them perfectly static. The savings per request are small, but across high-volume workloads they add up quickly.
How do I actually verify caching is working?
Each provider exposes this differently in the response. Claude's response includes `cache_creation_input_tokens` and `cache_read_input_tokens` in the `usage` field — a non-zero `cache_read_input_tokens` confirms a hit. OpenAI exposes `prompt_tokens_details.cached_tokens` in the usage object. Gemini shows cache usage in the response metadata. Always log these fields when first deploying caching — don't assume it's working just because you added the right parameters.
Does prompt caching reduce response quality or speed?
No — and this surprises some developers. The cached KV state is mathematically equivalent to recomputing it fresh. You get the same response quality you'd get without caching. Response latency may actually improve on cache hits since the model skips significant computation for the cached tokens, though this effect varies by provider and isn't guaranteed.


