We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.
We've shipped all three patterns — fine-tuning, RAG, and long-context prompting — to production over the past 12 months. They are not interchangeable. Each solves a different problem, costs differently, and fails differently. Here's the framework we use now to pick which one fits a given task, with the numbers to back it.
You take a base model and continue training it on your data. The result is a model that "knows" your data shape, tone, or task pattern as a learned behavior.
You search a corpus at runtime and inject relevant chunks into the prompt context.
You shove the entire relevant content (or close to it) into a single large prompt.
Question: Does the corpus fit comfortably in the model's effective context?
├─ Yes (e.g., < 50k tokens, model handles 200k+):
│ └─ Use long-context unless cost is a constraint.
└─ No:
├─ Q: Is the task about retrieving facts/info from documents?
│ └─ Yes → Use RAG.
└─ Q: Is the task about producing a specific format/tone/behavior consistently?
└─ Yes → Fine-tune.
Special case: combine RAG + fine-tuning when you need both fact retrieval and consistent output format.
The framework is simple but each branch has caveats. Below: real workloads where we picked each.
Task: classify incoming tickets into one of 24 internal categories.
We tried RAG first (retrieve historical tickets, pass to LLM, ask it to classify). It worked but was inconsistent — model would invent new category names occasionally.
We fine-tuned gpt-4o-mini on 8,000 hand-labeled examples.
| Metric | RAG (gpt-4o-mini) | Fine-tuned (gpt-4o-mini) |
|---|---|---|
| Accuracy on test set | 91% | 96% |
| Output schema violations | 4.2% | 0.1% |
| p95 latency | 800ms | 240ms |
| Cost per request | $0.0021 | $0.0008 |
Fine-tuning won decisively. The format consistency alone justified it.
When to choose fine-tuning: the behavior (tone, structure, classification) matters more than retrieving facts.
Task: answer arbitrary questions about our internal docs (~280k pages).
Fine-tuning was a non-starter (corpus too large, updates too frequent). Long-context was a non-starter (280k pages × ~500 tokens = 140M tokens; doesn't fit anywhere reasonable).
We built RAG. The system is detailed in our earlier post; the key choices:
| Metric | RAG (final) |
|---|---|
| Recall@8 | 95% |
| Hallucination rate | 5.7% |
| Cost per query | $0.0029 |
| Source citations | yes |
When to choose RAG: corpus too big to fit in context, freshness matters, source attribution is required.
Task: summarize a PR diff and the files it touches into a 200-word review.
We tried RAG: chunk the PR, retrieve relevant pieces, ask LLM. It missed the cross-file reasoning consistently because RAG retrieves piecewise; understanding "B's contract" required seeing both A and B fully.
We tried fine-tuning. The base model could do the task; fine-tuning hurt because the "right" review depends on context that's never the same across PRs.
We finally tried just stuffing the full PR + relevant files into a 100k-token prompt and asking the LLM directly.
| Metric | RAG | Fine-tuned | Long-context |
|---|---|---|---|
| Reviewer "useful" rating | 64% | 51% | 89% |
| Catches cross-file issues | 38% | 22% | 71% |
| Cost per PR | $0.018 | $0.012 | $0.082 |
| p95 latency | 4s | 2s | 11s |
Long-context dominated quality at higher cost. We accepted the cost for this workload because:
When to choose long-context: small-to-medium corpora that fit cleanly, cross-document reasoning required, cost-per-request not the binding constraint.
For one workload — a customer-facing assistant that answers product questions — we use both:
The fine-tune doesn't store facts. The retrieval doesn't enforce style. Together they handle both axes.
For analytics summaries over a quarter's worth of metrics:
This avoided the cost of stuffing everything in long-context, while still allowing cross-metric reasoning that pure RAG would have missed.
For our workloads, normalized to $/1k operations:
| Pattern | $/1k operations | Note |
|---|---|---|
| RAG (gpt-4o-mini, ~3k ctx) | $2.90 | typical |
| Fine-tuned (gpt-4o-mini) | $0.80 | per-request only; training cost amortized |
| Long-context (gpt-4o, ~50k ctx) | $80–120 | 10–40× higher |
| Combined RAG + FT | $2.20 | savings from smaller fine-tune model |
Fine-tuning's per-request cost is shockingly low. The hidden cost is training (~$300–800 per round in our experience) and the discipline of maintaining a labeled dataset.
Our first fine-tune attempt was a "company knowledge" model. We trained it on internal docs hoping it would memorize them. It didn't — it half-remembered, half-hallucinated. Fine-tuning is for behavior, not facts.
We built RAG for the PR review task. It missed cross-file issues constantly. We blamed retrieval quality, cross-encoder, prompts — for weeks. The actual problem: the task structurally needs everything in one context.
If the question is "given this whole thing, do X," RAG is the wrong primitive.
Our first cost analysis suggested long-context was insanely expensive. True per request — but for low-volume tasks, the absolute cost is fine. Cost analyses must include volume.
Our first fine-tune had no proper eval set. We thought it was working great. It wasn't — the few cases we'd manually checked weren't representative. Always build a held-out eval set before tuning.
Two trends are shifting these tradeoffs:
Cheaper long-context. As context costs drop, more workloads become viable for long-context. The break-even with RAG narrows monthly.
Better small fine-tunes. Open models (Llama, Qwen, etc.) fine-tune faster and cheaper. Expect more workloads to favor a small fine-tuned local model over an API call.
That said, the decision framework hasn't changed — only the boundaries between branches.
Before any of the above, one question matters most: what fails when this is wrong?
The cost of being wrong determines how much engineering you should spend. Many teams over-engineer for non-critical paths and under-engineer for critical ones. Match the investment to the consequence.
Pick the simplest pattern that handles your failure mode. Move to a more complex pattern only when the simpler one demonstrably fails.
We have:
Out of those, only the customer-facing assistant uses two patterns combined. Most workloads are best served by one of the three, picked deliberately. The biggest mistake we made was assuming there was a "best" approach. There isn't — there's the right approach for the task.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.
We ran the same RAG workload across three vector stores for a quarter each. Here's what we learned about latency, cost, and operational overhead.
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Evergreen posts worth revisiting.