We cut LLM inference cost 47% over a quarter while improving p95 latency. Six changes, ranked by what each one actually delivered.
Our LLM inference bill peaked at about $14k/month last summer. Same workload now runs us about $7,500/month with better p95 latency. Below is what each change contributed, in rough order of impact. None of these are clever; the cumulative effect is.
Three numbers, tracked weekly:
Without these, "we should optimize cost" is a directionless conversation. The metrics dashboard has been the most-viewed dashboard in engineering for two quarters running.
We were calling gpt-4o for everything. After one week of categorising actual user queries, most fell into three buckets:
gpt-4o-mini with no quality dropgpt-4o-mini with a schemagpt-4oWe routed each path explicitly. About 70% of total volume went to mini, 30% to the larger model. Quality (measured by our internal eval) was unchanged. Cost dropped roughly 22% from this alone.
The trap is to use one model for everything because it's simpler to reason about. Once you have an eval set, the question "does this route work on the cheaper model" is empirical, not philosophical.
A surprising fraction of queries are restatements of recent queries. Different wording, same intent. We added a semantic cache:
Hit rate stabilised around 18%. Cost dropped accordingly. p95 latency dropped 35% on cache hits (cache lookup is ~3ms; LLM call is ~1.2s).
Tuning the similarity threshold matters. We started at 0.85 and got false positives (different intents collapsed onto the same cached answer). 0.92 is conservative enough to feel safe; 0.95 had too low a hit rate to be worth the infrastructure.
Our prompts had grown organically. We did a token audit on each prompt, expecting maybe 1-2k tokens of input. Reality was 4-6k for most production routes. Most of that was:
We trimmed prompts to the essential structure and re-ran eval to verify quality didn't drop. Some routes lost 800-1500 tokens of input. At our scale, that's real money.
We now run a quarterly prompt audit. Anything that's grown more than 20% from baseline gets reviewed.
Streaming responses doesn't directly reduce cost (you pay for the same tokens), but it dramatically improves perceived latency, and it lets us cancel mid-generation if the user navigates away. About 4% of streamed requests get cancelled before completion; we don't pay for the unsent tokens.
We didn't enable streaming everywhere — some routes consume the response programmatically and don't benefit. But for any user-facing chat or completion, streaming was free latency and small cost savings.
For our highest-volume path (the classification route after Change 1 sent it to gpt-4o-mini), we benchmarked Anthropic's claude-haiku and a few open-weights options. claude-haiku was a touch cheaper at our volume with comparable quality.
We didn't migrate fully — vendor diversity has reliability value — but we route ~30% of the classification path to Anthropic and 70% to OpenAI. The split also acts as a hot failover: if either provider has an outage, we shift weight in real time.
Some of our requests aren't user-facing. They're back-office classifications: "categorize this incoming email." Those don't need < 1s response. We batch them, hit the OpenAI batch API (50% off list price), accept the 24h SLA.
This required identifying which routes truly didn't need real-time response — about 15% of total request volume turned out to qualify. The hard part wasn't technical; it was getting product to confirm "yes, this can wait up to 24 hours."
We tried these and gave up:
gpt-4o-mini quality up to gpt-4o for our hardest route. Quality matched on average but p99 was wildly variable. We needed predictability more than peak performance.The 47% cost reduction has stabilised, but we're not done. Areas we know we're leaving money on the table:
Build cost-per-request and cost-per-token dashboards before you optimise. Without them, you're guessing.
Right-size the model first. The single biggest lever is "don't use the most expensive model for tasks where the cheaper one is identical." Categorise your actual production queries and route accordingly.
Cache aggressively but conservatively. Semantic cache hits are pure win when the threshold is right; with the threshold too loose, you serve subtly wrong answers and lose user trust faster than you save cost.
Eval your changes, every time. The temptation is to skip eval on "obviously safe" changes. The savings dashboard makes them feel safer than they are. Eval is what lets you ship cost optimisations without quality regressions sneaking through.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.