We cut LLM inference cost 47% over a quarter while improving p95 latency. Six changes, ranked by what each one actually delivered.

On this page

AI Inference Cost Optimization

Our LLM inference bill peaked at about $14k/month last summer. Same workload now runs us about $7,500/month with better p95 latency. Below is what each change contributed, in rough order of impact. None of these are clever; the cumulative effect is.

What we measure #

Three numbers, tracked weekly:

Cost per request (USD): total LLM spend divided by request count
Cost per output token: useful for comparing across providers
p95 latency: cost optimisations that hurt latency > 10% get reverted

Without these, "we should optimize cost" is a directionless conversation. The metrics dashboard has been the most-viewed dashboard in engineering for two quarters running.

Change 1: Right-size the model per route (~22% saved)#

We were calling gpt-4o for everything. After one week of categorising actual user queries, most fell into three buckets:

Simple classification ("is this question about billing or technical?") — could use gpt-4o-mini with no quality drop
Structured extraction ("pull the order ID from this message") — could use gpt-4o-mini with a schema
Open-ended generation ("draft a response to this customer issue") — needed gpt-4o

We routed each path explicitly. About 70% of total volume went to mini, 30% to the larger model. Quality (measured by our internal eval) was unchanged. Cost dropped roughly 22% from this alone.

The trap is to use one model for everything because it's simpler to reason about. Once you have an eval set, the question "does this route work on the cheaper model" is empirical, not philosophical.

Change 2: Semantic caching (~14% saved)#

A surprising fraction of queries are restatements of recent queries. Different wording, same intent. We added a semantic cache:

Embed the user's query
Look up nearest neighbours in a small Redis-backed cache from the last 24 hours
If the nearest neighbour is above a similarity threshold (cosine > 0.92) AND the LLM context (retrieved docs) is the same, return the cached response

Hit rate stabilised around 18%. Cost dropped accordingly. p95 latency dropped 35% on cache hits (cache lookup is ~3ms; LLM call is ~1.2s).

Tuning the similarity threshold matters. We started at 0.85 and got false positives (different intents collapsed onto the same cached answer). 0.92 is conservative enough to feel safe; 0.95 had too low a hit rate to be worth the infrastructure.

Change 3: Trim prompt fat (~5% saved)#

Our prompts had grown organically. We did a token audit on each prompt, expecting maybe 1-2k tokens of input. Reality was 4-6k for most production routes. Most of that was:

Boilerplate "you are a helpful assistant" preamble that the user never saw and which the model would have inferred anyway
Examples (few-shot) that had been added during testing and never re-evaluated
Instructions that contradicted each other (mostly unnoticed)

We trimmed prompts to the essential structure and re-ran eval to verify quality didn't drop. Some routes lost 800-1500 tokens of input. At our scale, that's real money.

We now run a quarterly prompt audit. Anything that's grown more than 20% from baseline gets reviewed.

Change 4: Streaming where useful (~minor cost, big UX win)#

Streaming responses doesn't directly reduce cost (you pay for the same tokens), but it dramatically improves perceived latency, and it lets us cancel mid-generation if the user navigates away. About 4% of streamed requests get cancelled before completion; we don't pay for the unsent tokens.

We didn't enable streaming everywhere — some routes consume the response programmatically and don't benefit. But for any user-facing chat or completion, streaming was free latency and small cost savings.

Change 5: Switch to provider with better $/token for the bulk path (~3% saved)#

For our highest-volume path (the classification route after Change 1 sent it to gpt-4o-mini), we benchmarked Anthropic's claude-haiku and a few open-weights options. claude-haiku was a touch cheaper at our volume with comparable quality.

We didn't migrate fully — vendor diversity has reliability value — but we route ~30% of the classification path to Anthropic and 70% to OpenAI. The split also acts as a hot failover: if either provider has an outage, we shift weight in real time.

Change 6: Batch where latency tolerates it (~3% saved)#

Some of our requests aren't user-facing. They're back-office classifications: "categorize this incoming email." Those don't need < 1s response. We batch them, hit the OpenAI batch API (50% off list price), accept the 24h SLA.

This required identifying which routes truly didn't need real-time response — about 15% of total request volume turned out to qualify. The hard part wasn't technical; it was getting product to confirm "yes, this can wait up to 24 hours."

What didn't help #

We tried these and gave up:

Fine-tuning on cheaper models to bring gpt-4o-mini quality up to gpt-4o for our hardest route. Quality matched on average but p99 was wildly variable. We needed predictability more than peak performance.
Aggressive max-tokens limits on output. Saved tokens but truncated responses occasionally. Customer support hated it. Reverted.
Compressing input via summarisation before the actual call. Added a second LLM call which negated most of the savings, plus introduced a new error mode (lossy summary).

What we still don't have right #

The 47% cost reduction has stabilised, but we're not done. Areas we know we're leaving money on the table:

Long contexts. A few specialised routes use 80k+ token contexts. They're expensive per call. We've been experimenting with retrieval to cut context length but haven't shipped it.
Reasoning models for our hardest paths. Anthropic and OpenAI both offer reasoning-focused models that are slower but sometimes give better-quality answers per dollar. Eval is mid-flight.
Per-customer cost tracking. We bill flat-rate; some power users cost us 10x as much as average. Useful information for product, but we haven't operationalised it.

What I'd tell a team starting #

Build cost-per-request and cost-per-token dashboards before you optimise. Without them, you're guessing.

Right-size the model first. The single biggest lever is "don't use the most expensive model for tasks where the cheaper one is identical." Categorise your actual production queries and route accordingly.

Cache aggressively but conservatively. Semantic cache hits are pure win when the threshold is right; with the threshold too loose, you serve subtly wrong answers and lose user trust faster than you save cost.

Eval your changes, every time. The temptation is to skip eval on "obviously safe" changes. The savings dashboard makes them feel safer than they are. Eval is what lets you ship cost optimisations without quality regressions sneaking through.

Best Practices: AI Inference Cost Optimization

AI Inference Cost Optimization

What we measure #

Change 1: Right-size the model per route (~22% saved)#

Change 2: Semantic caching (~14% saved)#

Change 3: Trim prompt fat (~5% saved)#

Change 4: Streaming where useful (~minor cost, big UX win)#

Change 5: Switch to provider with better $/token for the bulk path (~3% saved)#

Change 6: Batch where latency tolerates it (~3% saved)#

What didn't help #

What we still don't have right #

What I'd tell a team starting #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes