Skip to content

Anthropic added prompt caching in 2024. OpenAI followed. By 2026 it is a default on any serious LLM provider. Most teams still leave half the savings on the table because they only cache the obvious thing. Here are the four patterns that stack.

Pattern 1 · system prompt

The easiest win. Mark the system prompt as cacheable. Every subsequent call reuses the cached prefix. Typical savings: 30-50% of total token cost on chatty support agents.

Pattern 2 · static RAG context

If your RAG retrieves from a relatively stable corpus, the top-5 chunks are the same for many similar queries. Cache those chunks as a prefix block. Typical savings: 20-30% on top of pattern 1.

Pattern 3 · tool schemas

Tool definitions (function schemas) are large and static across calls. Mark them cacheable. Typical savings: 10-15% on agentic workloads with many tools.

Pattern 4 · few-shot examples

If your prompt has few-shot examples (classification, extraction), they do not change per call. Cache. Typical savings: 10-20% on extraction-heavy pipelines.

Two gotchas

  • Cache TTL is ~5 min on Anthropic, ~10 min on OpenAI. Low-traffic systems get cache misses constantly. Pre-warm with a background keep-alive if traffic is bursty.
  • Prompt-caching pricing model varies · Anthropic charges ~25% extra on first write, OpenAI is free. Budget for it.

Measure cost before and after per 1000 production queries. If your bill is not 60%+ lower, you missed a pattern. Every one of our 2026 RAG deployments hits or exceeds that number.

ShareXLinkedIn#
Dezso Mezo

By

Dezso Mezo

Founder, DField Solutions

I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.

Keep reading

RELATED PROJECTS

Would rather build together?

Let's talk about your project. 30 minutes, no strings.

Let's talk