DField SolutionsMérnöki stúdió · Budapest
Loading · Töltődik
Skip to content

Prompt caching

Related service AI solutions

DEFINITION

Major model providers (Anthropic, OpenAI, Google) let you mark the rarely-changing front of the prompt (system prompt, document context, tool definitions) as cacheable on their side. A follow-up call within roughly 5 minutes that reuses the same prefix can cut input-token cost by up to 90 percent and roughly halve time-to-first-token. It pays off when the prefix is at least a few thousand tokens and many calls share it, for example a support bot, a RAG pipeline, or a code-review agent. It does not pay off when the prompt is unique per call (user-level personalisation injected into the middle of the prefix) or when context is only a few hundred tokens. Architect the prompt so the stable bulk is at the front and the volatile user turn at the back.

RELATED TERMS06
  • Context Engineering

    The successor to prompt engineering: deliberately curating what enters the model's context window - system prompt, retrieved docs, tools, memory. Goal is max accuracy on the fewest tokens. A model only knows what you put in front of it.

  • AI Gateway

    A proxy layer between your app and LLM providers (OpenAI, Anthropic): routing, retries, caching, rate-limits, key management, cost tracking and failover. One place to see your whole AI bill - and no lock-in to a single vendor.

  • Model Routing

    Send each request to the cheapest model that can handle it: a small model for easy queries, a frontier model for hard ones - often decided by a classifier. Cuts inference cost dramatically, frequently 5-10× on real traffic.

  • Graph RAG

    A RAG variant that retrieves over a knowledge graph (entities + relationships) instead of flat text chunks. Lets the model answer multi-hop questions ("how is X connected to Y?") that pure vector search misses.

  • Agent Memory

    How an AI agent persists state across turns and sessions: short-term (the context window), long-term (a vector store / DB of facts), and episodic. The difference between an agent that forgets and one that learns your business.

  • Synthetic Data

    Model-generated training and eval data for when real data is scarce, sensitive (GDPR), or imbalanced. Useful, but you must check quality and diversity - otherwise you bake the model's own blind spots into your system.

MENTIONED IN THE BLOG08