A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.

On this page

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

Our agent's system prompt was 4,800 tokens: tool definitions, formatting rules, a few examples, policy text. Stable across every request, and re-processed — and re-billed — on every single call. At our volume that was the largest line on the bill and a chunk of every request's latency, all spent re-reading text that never changed. Prompt caching fixed it, and the only real work was structuring the prompt so the cache could do its job.

What prompt caching actually does #

When you mark a prefix of your prompt as cacheable, the provider stores the model's internal representation (the processed KV state) of that prefix. On the next request with the identical prefix, it reuses that state instead of recomputing it. You're billed a steep discount for cached input tokens (often ~10% of the normal rate) and you skip the compute for re-processing them — so both cost and time-to-first-token drop.

The critical mechanic: caching works on an exact-match prefix. The cache hits only for the contiguous span from the start of the prompt up to where it still matches byte-for-byte. The first token that differs ends the cacheable region.

The one rule: stable content first, dynamic content last #

This is the whole discipline. Order your prompt so everything that's constant across requests comes first, and everything that varies comes last:

code

[ system instructions      ]  ← stable      ┐
[ tool / function definitions]  ← stable      │ cacheable prefix
[ few-shot examples        ]  ← stable      │
[ policy / formatting rules]  ← stable      ┘  ← cache breakpoint here
─────────────────────────────
[ retrieved context        ]  ← varies
[ conversation history     ]  ← varies
[ user's current message   ]  ← varies

Put one dynamic token near the top — a timestamp, a request ID, the user's name interpolated into the system prompt — and you've poisoned the entire prefix. The cache matches up to that token and no further. We had a Current date: ... line at the top of the system prompt; moving it below the cache breakpoint took our hit rate from near-zero to ~95%.

Setting the cache breakpoint #

With providers that use explicit cache markers, you tag the last stable block:

python.python

messages = [{
    "role": "system",
    "content": [
        {"type": "text", "text": TOOL_DEFS + POLICY + EXAMPLES,
         "cache_control": {"type": "ephemeral"}},   # ← cache up to here
    ],
}, {
    "role": "user",
    "content": dynamic_context + user_message,       # not cached, changes per request
}]

Everything from the start of the prompt through the marked block is cached; everything after is processed fresh each call. You want the breakpoint as far down as possible while staying before the first thing that varies.

TTL and hit rate #

Caches expire — typically a short TTL (minutes) refreshed on each hit. The implication: caching pays off when requests sharing a prefix arrive close together in time. A high-traffic endpoint with a shared system prompt is the ideal case; the cache stays warm continuously. A low-traffic endpoint may see the cache expire between requests, so you pay the (small) cache-write premium without reaping the read discount.

Check the hit rate the provider reports (cache-read vs cache-write tokens). If you're mostly writing and rarely reading, either your prefix isn't actually stable (find the poisoning token) or your traffic is too sparse for caching to help.

Conversation history: cache the growing prefix #

For multi-turn agents, the conversation history is append-only — each turn adds to the end, the beginning stays identical. That's a perfect caching shape. Place the cache breakpoint at the end of the prior turn so each new request re-uses the entire conversation so far as a cached prefix and only processes the new turn. Long agent loops benefit the most, because the cached prefix grows while the per-turn fresh tokens stay small.

What it bought us #

Cost: the 4,800-token stable prefix dropped to ~10% of its previous input cost per request; on a system-prompt-heavy workload that was a large fraction of the total bill.
Latency: time-to-first-token improved noticeably because the model skips re-processing the cached prefix — the bigger the stable prefix, the bigger the win.
Behavior: zero change. Caching is a pure cost/latency optimization; the model sees the identical prompt and produces identical output.

The discipline, restated #

Prompt caching is almost free to adopt and almost free to ruin. Structure every prompt as [stable] then [dynamic], set the cache breakpoint at the boundary, keep dynamic tokens (dates, IDs, names) strictly below it, and watch the cache-read ratio. Done right, you stop paying full price to re-read the same instructions thousands of times a day for nothing.

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

What prompt caching actually does #

The one rule: stable content first, dynamic content last #

Setting the cache breakpoint #

TTL and hit rate #

Conversation history: cache the growing prefix #

What it bought us #

The discipline, restated #

Stay Updated

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Cloud IAM Least-Privilege Without Breaking Everything

More from AI

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

LLM Cost Optimization in Production — What Actually Moves the Bill

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

AWS Reserved Instances vs Savings Plans vs Spot — When Each Fits

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

What prompt caching actually does#

The one rule: stable content first, dynamic content last#

Setting the cache breakpoint#

TTL and hit rate#

Conversation history: cache the growing prefix#

What it bought us#

The discipline, restated#

Stay Updated

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Cloud IAM Least-Privilege Without Breaking Everything

More from AI

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

What prompt caching actually does #

The one rule: stable content first, dynamic content last #

Setting the cache breakpoint #

TTL and hit rate #

Conversation history: cache the growing prefix #

What it bought us #

The discipline, restated #