Why your AI agent leaks money · 6 prompt-cache wins worth doing this week
Six prompt-cache patterns. Real before/after numbers. Most agents leave 60-80% on the table. Fix it this week.
Six prompt-cache patterns. Real before/after numbers. Most agents leave 60-80% on the table. Fix it this week.
Three months. Six different agent codebases opened with the same brief: 'cut the LLM bill'. Six times the same problem. The system prompt and the tool definitions get re-billed every call. Caching either off entirely, or partially wired to the system prompt only.
These six patterns stack. Each one gives you 10-30%. Stacked, you land at 60-80% cost cut on a typical agent (around 18k prompt tokens, 4k calls/day, Sonnet 4.5). Real shipped agent: monthly bill from 980 EUR to 220 EUR. Two engineering days.
Cheapest win. In the Anthropic SDK, mark the last block of the system prompt with `cache_control: { type: 'ephemeral' }`. 5 minute TTL, every subsequent call reuses the cached prefix. Typical save: 30-50% of total token cost when the system prompt is over 2k tokens.
A 12 tool agent typically carries 4-6k tokens of tool definitions. Static across calls. Drop a cache breakpoint at the end of the tools block. 10-15% off the bill.
const response = await anthropic.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
tools: tools.map((t, i, arr) => i === arr.length - 1
? { ...t, cache_control: { type: "ephemeral" } }
: t),
system: [
{ type: "text", text: STATIC_INSTRUCTIONS },
{ type: "text", text: COMPANY_KB, cache_control: { type: "ephemeral" } },
],
messages,
});Anthropic's Files API lets you upload large static documents (a wiki, product catalogue, 60-page manual), get a file_id, and reference the file by ID on every call. The platform caches the prefix automatically. On a 60 page PDF we measured daily context cost dropping from ~7 EUR to ~0.40 EUR.
Long chats (developer copilots, multi-turn support agents) re-bill the entire history on each turn. If your conversation has a stable prefix (system + few-shot + retrieved context), put the cache breakpoint at the end of the stable block, not on the last user turn. 20-40% save on chat agents.
Classification or extraction agents carry 3-5k tokens of few-shot examples. Static. Put them in their own block, mark cacheable. Plus 10-20%. Watch the hash · ordering and whitespace count, even one swapped example invalidates the cache.
Cache TTL is ~5 min on Anthropic, ~10 min on OpenAI. If you serve 200 calls/day spread across 24 hours, your hit ratio dies. A 1-minute cron that sends a no-op user message against the stable prefix keeps the cache warm. Cost: under 0.10 EUR/day. Save: 30-50% of normal call token costs.
Production support agent for a Hungarian client. 3800 calls/day, ~14k token average prompt, Sonnet 4.5.
Measure across 1000 production calls before and after. If you are not below 50% of the original bill, you skipped a pattern. The 60-80% number is real, not marketing.

Founder, DField Solutions
I've shipped production products from fintech to creator-tooling · for startups and enterprises, from Budapest to San Francisco.
Let's talk about your project. 30 minutes, no strings.