We've shipped all three patterns to production. They're not interchangeable. Here's the framework we now use to decide which approach fits a given task.

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

We've shipped all three patterns — fine-tuning, RAG, and long-context prompting — to production over the past 12 months. They are not interchangeable. Each solves a different problem, costs differently, and fails differently. Here's the framework we use now to pick which one fits a given task, with the numbers to back it.

What Each Pattern Actually Does #

Fine-Tuning #

You take a base model and continue training it on your data. The result is a model that "knows" your data shape, tone, or task pattern as a learned behavior.

Good at: consistent format/tone across responses, classification on a specific schema, behavior the base model resists or does inconsistently.
Bad at: holding facts. Fine-tuning is bad at fact storage. The model still hallucinates.

RAG (Retrieval-Augmented Generation)#

You search a corpus at runtime and inject relevant chunks into the prompt context.

Good at: answering questions about a specific document set; updating with new info instantly (just re-index); citing sources.
Bad at: large-scope reasoning across documents; tasks where the form of the response is non-obvious.

Long-Context Prompting #

You shove the entire relevant content (or close to it) into a single large prompt.

Good at: tasks where you don't know in advance what info is relevant; cross-document analysis; small-to-medium corpora.
Bad at: cost (every request pays for the full context); large corpora; needle-in-haystack at the extremes.

The Decision Framework #

code

Question: Does the corpus fit comfortably in the model's effective context?
├─ Yes (e.g., < 50k tokens, model handles 200k+):
│   └─ Use long-context unless cost is a constraint.
└─ No:
    ├─ Q: Is the task about retrieving facts/info from documents?
    │   └─ Yes → Use RAG.
    └─ Q: Is the task about producing a specific format/tone/behavior consistently?
        └─ Yes → Fine-tune.

Special case: combine RAG + fine-tuning when you need both fact retrieval and consistent output format.

The framework is simple but each branch has caveats. Below: real workloads where we picked each.

Workload 1: Customer Support Triage (Fine-Tuning)#

Task: classify incoming tickets into one of 24 internal categories.

Volume: ~12k tickets/day
Required: very consistent output format (structured), domain-specific vocabulary
Latency target: < 500ms

We tried RAG first (retrieve historical tickets, pass to LLM, ask it to classify). It worked but was inconsistent — model would invent new category names occasionally.

We fine-tuned gpt-4o-mini on 8,000 hand-labeled examples.

Metric	RAG (gpt-4o-mini)	Fine-tuned (gpt-4o-mini)
Accuracy on test set	91%	96%
Output schema violations	4.2%	0.1%
p95 latency	800ms	240ms
Cost per request	$0.0021	$0.0008

Fine-tuning won decisively. The format consistency alone justified it.

When to choose fine-tuning: the behavior (tone, structure, classification) matters more than retrieving facts.

Workload 2: Internal Knowledge Q&A (RAG)#

Task: answer arbitrary questions about our internal docs (~280k pages).

Volume: ~190k queries/day
Required: cite sources; never invent
Updates: docs change daily; hourly index refresh acceptable

Fine-tuning was a non-starter (corpus too large, updates too frequent). Long-context was a non-starter (280k pages × ~500 tokens = 140M tokens; doesn't fit anywhere reasonable).

We built RAG. The system is detailed in our earlier post; the key choices:

Hybrid search (dense + BM25 + cross-encoder rerank)
Strict prompt instructing "cite or refuse"
Confidence threshold for "I don't know" responses

Metric	RAG (final)
Recall@8	95%
Hallucination rate	5.7%
Cost per query	$0.0029
Source citations	yes

When to choose RAG: corpus too big to fit in context, freshness matters, source attribution is required.

Workload 3: Code Review Summary (Long-Context)#

Task: summarize a PR diff and the files it touches into a 200-word review.

Volume: ~600 PRs/day
Required: cross-file reasoning ("this change to A breaks the contract used in B")
Corpus: a single PR's worth of files, typically 5–50k tokens

We tried RAG: chunk the PR, retrieve relevant pieces, ask LLM. It missed the cross-file reasoning consistently because RAG retrieves piecewise; understanding "B's contract" required seeing both A and B fully.

We tried fine-tuning. The base model could do the task; fine-tuning hurt because the "right" review depends on context that's never the same across PRs.

We finally tried just stuffing the full PR + relevant files into a 100k-token prompt and asking the LLM directly.

Metric	RAG	Fine-tuned	Long-context
Reviewer "useful" rating	64%	51%	89%
Catches cross-file issues	38%	22%	71%
Cost per PR	$0.018	$0.012	$0.082
p95 latency	4s	2s	11s

Long-context dominated quality at higher cost. We accepted the cost for this workload because:

PRs are infrequent (600/day) compared to chat workloads (190k/day)
Quality matters more than latency for a review summary
Total bill (~$50/day) was well within budget for the use case

When to choose long-context: small-to-medium corpora that fit cleanly, cross-document reasoning required, cost-per-request not the binding constraint.

When We Combined Approaches #

RAG + Fine-Tuning #

For one workload — a customer-facing assistant that answers product questions — we use both:

RAG retrieves relevant docs (correctness)
Fine-tuning teaches consistent tone, refusal patterns, and our specific style (behavior)

The fine-tune doesn't store facts. The retrieval doesn't enforce style. Together they handle both axes.

Long-Context + RAG #

For analytics summaries over a quarter's worth of metrics:

RAG narrows from "all metrics" to "metrics relevant to the question"
Long-context handles the resulting 30–80k tokens of relevant data

This avoided the cost of stuffing everything in long-context, while still allowing cross-metric reasoning that pure RAG would have missed.

Cost Reality Check #

For our workloads, normalized to $/1k operations:

Pattern	$/1k operations	Note
RAG (gpt-4o-mini, ~3k ctx)	$2.90	typical
Fine-tuned (gpt-4o-mini)	$0.80	per-request only; training cost amortized
Long-context (gpt-4o, ~50k ctx)	$80–120	10–40× higher
Combined RAG + FT	$2.20	savings from smaller fine-tune model

Fine-tuning's per-request cost is shockingly low. The hidden cost is training (~$300–800 per round in our experience) and the discipline of maintaining a labeled dataset.

What We Got Wrong #

1. Fine-Tuned For Facts #

Our first fine-tune attempt was a "company knowledge" model. We trained it on internal docs hoping it would memorize them. It didn't — it half-remembered, half-hallucinated. Fine-tuning is for behavior, not facts.

2. Used RAG When We Needed Long-Context #

We built RAG for the PR review task. It missed cross-file issues constantly. We blamed retrieval quality, cross-encoder, prompts — for weeks. The actual problem: the task structurally needs everything in one context.

If the question is "given this whole thing, do X," RAG is the wrong primitive.

3. Compared Costs Naively #

Our first cost analysis suggested long-context was insanely expensive. True per request — but for low-volume tasks, the absolute cost is fine. Cost analyses must include volume.

4. Skipped Eval Sets #

Our first fine-tune had no proper eval set. We thought it was working great. It wasn't — the few cases we'd manually checked weren't representative. Always build a held-out eval set before tuning.

A Practical Checklist for Each Pattern #

Before Fine-Tuning, Confirm #

You have ≥ 1,000 high-quality labeled examples
The task is about behavior/format, not fact retrieval
You have an eval set (200+ items, held out)
You're willing to retrain when behavior needs to change
The base model gets the task "almost right" but inconsistent

Before Building RAG, Confirm #

The corpus is too large for long-context
You need source attribution (or fact freshness)
You're willing to invest in retrieval quality (not just dense embeddings)
You can build an eval set for retrieval (recall@k) and end-to-end quality

Before Going Long-Context, Confirm #

The relevant corpus per request fits comfortably (with margin)
Cost per request × volume is within budget
Latency budget tolerates 5–20s responses
The task benefits from cross-document reasoning

What's Coming #

Two trends are shifting these tradeoffs:

Cheaper long-context. As context costs drop, more workloads become viable for long-context. The break-even with RAG narrows monthly.
Better small fine-tunes. Open models (Llama, Qwen, etc.) fine-tune faster and cheaper. Expect more workloads to favor a small fine-tuned local model over an API call.

That said, the decision framework hasn't changed — only the boundaries between branches.

The Question To Ask First #

Before any of the above, one question matters most: what fails when this is wrong?

Wrong category in support triage → annoying ticket reroute
Wrong fact in customer Q&A → user trust damage
Wrong code review summary → human reviewer notices

The cost of being wrong determines how much engineering you should spend. Many teams over-engineer for non-critical paths and under-engineer for critical ones. Match the investment to the consequence.

Pick the simplest pattern that handles your failure mode. Move to a more complex pattern only when the simpler one demonstrably fails.

In Practice #

We have:

14 production fine-tunes (all classification or formatting tasks)
3 RAG systems (knowledge bases of varying sizes)
2 long-context workflows (review summaries, periodic analyses)

Out of those, only the customer-facing assistant uses two patterns combined. Most workloads are best served by one of the three, picked deliberately. The biggest mistake we made was assuming there was a "best" approach. There isn't — there's the right approach for the task.

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

Stay Updated

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Postgres Autovacuum — Tuning From Production Stalls

More from AI

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production