A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.
Our agent's system prompt was 4,800 tokens: tool definitions, formatting rules, a few examples, policy text. Stable across every request, and re-processed — and re-billed — on every single call. At our volume that was the largest line on the bill and a chunk of every request's latency, all spent re-reading text that never changed. Prompt caching fixed it, and the only real work was structuring the prompt so the cache could do its job.
When you mark a prefix of your prompt as cacheable, the provider stores the model's internal representation (the processed KV state) of that prefix. On the next request with the identical prefix, it reuses that state instead of recomputing it. You're billed a steep discount for cached input tokens (often ~10% of the normal rate) and you skip the compute for re-processing them — so both cost and time-to-first-token drop.
The critical mechanic: caching works on an exact-match prefix. The cache hits only for the contiguous span from the start of the prompt up to where it still matches byte-for-byte. The first token that differs ends the cacheable region.
This is the whole discipline. Order your prompt so everything that's constant across requests comes first, and everything that varies comes last:
[ system instructions ] ← stable ┐
[ tool / function definitions] ← stable │ cacheable prefix
[ few-shot examples ] ← stable │
[ policy / formatting rules] ← stable ┘ ← cache breakpoint here
─────────────────────────────
[ retrieved context ] ← varies
[ conversation history ] ← varies
[ user's current message ] ← varies
Put one dynamic token near the top — a timestamp, a request ID, the user's name interpolated into the system prompt — and you've poisoned the entire prefix. The cache matches up to that token and no further. We had a Current date: ... line at the top of the system prompt; moving it below the cache breakpoint took our hit rate from near-zero to ~95%.
With providers that use explicit cache markers, you tag the last stable block:
messages = [{
"role": "system",
"content": [
{"type": "text", "text": TOOL_DEFS + POLICY + EXAMPLES,
"cache_control": {"type": "ephemeral"}}, # ← cache up to here
],
}, {
"role": "user",
"content": dynamic_context + user_message, # not cached, changes per request
}]
Everything from the start of the prompt through the marked block is cached; everything after is processed fresh each call. You want the breakpoint as far down as possible while staying before the first thing that varies.
Caches expire — typically a short TTL (minutes) refreshed on each hit. The implication: caching pays off when requests sharing a prefix arrive close together in time. A high-traffic endpoint with a shared system prompt is the ideal case; the cache stays warm continuously. A low-traffic endpoint may see the cache expire between requests, so you pay the (small) cache-write premium without reaping the read discount.
Check the hit rate the provider reports (cache-read vs cache-write tokens). If you're mostly writing and rarely reading, either your prefix isn't actually stable (find the poisoning token) or your traffic is too sparse for caching to help.
For multi-turn agents, the conversation history is append-only — each turn adds to the end, the beginning stays identical. That's a perfect caching shape. Place the cache breakpoint at the end of the prior turn so each new request re-uses the entire conversation so far as a cached prefix and only processes the new turn. Long agent loops benefit the most, because the cached prefix grows while the per-turn fresh tokens stay small.
Prompt caching is almost free to adopt and almost free to ruin. Structure every prompt as [stable] then [dynamic], set the cache breakpoint at the boundary, keep dynamic tokens (dates, IDs, names) strictly below it, and watch the cache-read ratio. Done right, you stop paying full price to re-read the same instructions thousands of times a day for nothing.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Free memory is a lie and load average doesn't see memory stalls. How Pressure Stall Information gives you a direct, early signal of memory contention — and how we wired it into alerts and autoscaling.
Least privilege fails when it's a one-time audit that locks things down until something breaks, then gets reverted. The iterative, log-driven approach that tightens permissions safely — and the policies we stopped writing by hand.
Explore more articles in this category
You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
They solve different problems. RAG injects knowledge; fine-tuning changes behavior. The decision criteria, the hybrid pattern, and what we'd do over.
Evergreen posts worth revisiting.