We cut our monthly LLM bill from $11,200 to $2,300 with seven specific changes. The ones that worked, the ones that didn't, and what we'd do first.
A year ago we were spending around $11,200/month on LLM API calls (mostly OpenAI, some Anthropic). After working through it methodically, we're now at $2,300/month for roughly the same workload. This post is the seven changes that got us there, ranked by impact, with the "we tried this and it didn't help" notes too.
Workload breakdown when we started:
gpt-4 (the original), 5% on gpt-3.5-turboThe biggest spend was a customer-facing assistant that answered questions about our product. ~60% of total cost.
We were using gpt-4 for everything because someone benchmarked it once and it was best. Different tasks have different difficulty, and most don't need the most capable model.
We re-benchmarked each feature against several models:
| Feature | Old | New | Quality Δ | Cost Δ |
|---|---|---|---|---|
| Customer assistant (RAG) | gpt-4 | gpt-4o-mini | -2% | -94% |
| Email categorizer | gpt-4 | gpt-4o-mini | -1% | -94% |
| Doc summarizer | gpt-4 | gpt-4o | -3% | -85% |
| Agentic task runner | gpt-4 | gpt-4o (with fallback to gpt-4) | +1% | -78% |
| Internal search query rewriter | gpt-3.5 | gpt-4o-mini | +5% | similar |
The biggest savings came from realizing classification and RAG-with-good-context don't need GPT-4. They need consistent output, and gpt-4o-mini (or Claude Haiku) does that for a fraction of the cost.
The "agentic task runner" needed careful handling. Cheap models would sometimes get stuck; we built a fallback: if the cheap model returns "I'm not confident" or hits a retry limit, escalate to GPT-4 for that task. Most tasks (~85%) finish on the cheap model.
Estimated saving: ~$5,500/month. The biggest single change.
Anthropic and OpenAI both added prompt caching: repeated prefixes (system prompt, few-shot examples) are billed at lower rates (50-90% off the cached portion).
We restructured our prompts to put the long stable parts first:
[CACHED — stable system prompt + examples + tool definitions, ~3000 tokens]
---
[NOT CACHED — user query + retrieved context, ~1500 tokens]
Before caching, every call billed for all 4500 input tokens. After caching, ~3000 tokens are at the cached rate (10% of normal for Anthropic's cache hits).
For our customer assistant, this dropped per-call cost by ~50%.
Estimated saving: ~$1,800/month.
Our RAG pipeline was retrieving 10 chunks and sending all 10 to the model. We added a re-ranker that scores the 10 against the query; we now send only the top 4 to the LLM.
Less context = fewer input tokens = less cost. Quality stayed flat (or improved slightly — less noise for the model to filter through).
Average input tokens dropped from ~2,200 to ~1,100. ~50% reduction in input cost on RAG queries.
Estimated saving: ~$1,400/month.
We had no max_tokens set. Some responses were 1500 tokens. Most should be 200.
We set per-task max_tokens based on the task:
Two effects: capped output cost, and forced the model to be concise (the prompts were updated to say "respond in N words"). Quality didn't suffer; users like shorter responses.
Estimated saving: ~$400/month.
For classification tasks, we don't need the model to keep generating after it has produced the category. We:
This stops the model mid-generation, avoiding tokens we don't use. For tasks where the model would otherwise generate explanations after the category, savings are real.
This works because OpenAI/Anthropic bill on tokens generated, even those not delivered. Closing the stream stops the meter.
Estimated saving: ~$300/month.
For background tasks (re-summarizing old docs, generating internal search indexes), we use OpenAI's Batch API: half-price for processing within 24 hours.
Most "urgent" features stayed real-time. About 20% of our LLM volume moved to batch.
Estimated saving: ~$600/month.
For the customer assistant, ~12% of queries were near-duplicates of previously-asked questions. We added a query-similarity cache: if a new query is semantically very close to a recent answered query, return the cached answer.
Implementation:
Cache hit rate: ~12%. Each hit saves a full LLM call.
Estimated saving: ~$400/month.
A few changes that sounded good but didn't deliver:
Self-hosting open-source models. We benchmarked Llama-3-70B on H100 instances. The throughput was OK but the cost (GPU rental) ended up similar to OpenAI's gpt-4o-mini for our patterns. Plus operational overhead. Not worth it for our scale; might be different at 10x our volume.
Distillation: fine-tuning a smaller model on GPT-4 outputs. Spent two weeks on this for one specific task. The fine-tuned model was 70% as good. The remaining 30% gap mattered for our use case (it was a customer-facing classifier where wrong answers hurt). Reverted.
Aggressive prompt compression (using a smaller model to compress context before passing to the bigger model). The compression itself costs tokens and loses information. Marginal at best.
Switching providers based on per-call cost. Tried routing each request to whichever provider was cheapest at that moment. The gain was small (most providers price similarly), and the operational complexity of multi-provider routing wasn't justified.
Visibility was as important as the changes:
The dashboards live in Grafana, fed from Datadog APM (we wrap every LLM call with span attributes for tokens and cost).
The 500-token user prompt that cost $50. A user typed a request like "summarize this:" followed by 100k tokens of pasted text. Our token budget didn't catch it; the LLM call cost $50. We added per-call hard input-token caps; anything beyond gets truncated with an explanatory message.
The agent that looped. A bug in an agent caused it to repeat its own output back to itself, growing the context each iteration. After 80 iterations, one task had cost $200. Per-task cumulative-token caps now stop this.
The SaaS feature that 100x'd its volume overnight. A customer enabled a feature heavily, generating 100k LLM calls in a day. Our daily cost jumped 50x. We added per-customer rate limits to prevent runaway costs from individual customers.
The cheapest token is the one you don't send. Before optimizing model selection, check if you're sending unnecessary input.
Use the right model for each task. This is the biggest lever. Don't run GPT-4 on classification tasks.
Add observability first. You can't optimize what you can't see. Per-task cost dashboards make the wins obvious.
Set per-task token caps. Hard limits prevent surprise bills from edge cases.
Cache when possible. Both prompt caching (provider-side) and response caching (your-side) compound.
Don't chase exotic optimizations early. Self-hosting, fine-tuning, multi-provider routing — these are big projects with marginal payoff at small scale. Hit the easy wins first.
Most teams I've talked to pay 3-10x more for LLM inference than they need to. The optimizations aren't exotic. They're: pick the right model, send less context, set output limits, cache where you can, and watch the bill.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.