Fine-tuning is rarely the right answer. We've fine-tuned three times in two years; few-shot or RAG was correct for everything else. The decision criteria.
When teams want a model to "do something specific to our domain," the first instinct is often "let's fine-tune." After running real LLM features in production for a couple of years, fine-tuning has been the right answer maybe 10% of the time. Few-shot prompting or retrieval-augmented generation usually wins. This post is the decision tree we now use, with the actual experiments behind it.
Few-shot prompting: include 2-10 examples in the prompt itself. The model "learns" the pattern from the examples in context. No training. Works at runtime.
Retrieval-augmented generation (RAG): store relevant content in a knowledge base, retrieve at query time, include in prompt as context. The model uses the retrieved info. No training.
Fine-tuning: actually train the model (or a LoRA adapter on top of it) on your specific data. Training takes hours to days. Inference uses the customized model.
Each has different strengths. The wrong choice for the problem wastes weeks or months.
Our decision tree:
Does the task need information that's not in the base model?
Does the task have a stable, narrow output format that the base model doesn't follow well?
Does the task require domain-specific reasoning the base model lacks?
Is the volume high enough that prompt-token cost dominates?
Does the task work reliably with few-shot already?
The first time someone says "we should fine-tune" the answer is usually "have you tried RAG or few-shot first."
Few-shot and RAG have advantages that compound:
Faster iteration. Changing a prompt is seconds. Changing a fine-tuned model is hours.
More controllable. When few-shot output is wrong, edit the prompt. When fine-tuned output is wrong, you may need to retrain (or prompt-engineer the fine-tuned model to compensate).
Easier to debug. "The model used this context to produce this output" is more inspectable than "the trained weights produced this output."
Cheaper. Inference cost is similar to base. No training cost. No GPU rental for fine-tuning.
Works on the latest models. When OpenAI/Anthropic ship a new model, your few-shot prompts work immediately. Fine-tunes have to be redone.
The cost: prompts are longer, inference is slower per call, and you can't bake in subtle behavior that's hard to express in examples.
Three times in two years:
Case 1: A specific format with strict requirements. A customer needed a model that always produced output in a specific structured format with hundreds of fields. Few-shot worked 95% of the time; the 5% failure rate was unacceptable for the use case. Fine-tuning got it to 99.5%. Worth it.
Case 2: A language model for a niche programming syntax. Our customer's product had a custom DSL. The base models knew nothing about it. Few-shot with examples got reasonable but not great output. Fine-tuning on a corpus of valid DSL programs improved quality dramatically and cut prompt size (no need for examples).
Case 3: Tone and voice for a customer-facing product. A product needed the model to consistently adopt a specific tone. Few-shot prompts helped but were inconsistent — the tone varied with the input. Fine-tuning on a curated dataset of "good" tone examples fixed it.
In each case, we had tried few-shot first, hit a quality ceiling that few-shot couldn't break, and concluded fine-tuning was needed.
Cases where we considered or attempted fine-tuning and reverted:
A categorization task (classify support tickets into 30+ categories). Few-shot with carefully chosen examples got 91% accuracy. We fine-tuned to push it higher; got 93%. The cost: training pipeline, monitoring drift, periodic retraining. The quality gain wasn't worth the operational overhead. Reverted to few-shot.
A summarization task with specific length requirements. Few-shot with max_tokens limit worked. Fine-tuning was attempted to "really nail the style"; the result wasn't measurably better than few-shot.
Domain-specific Q&A (we initially tried fine-tuning a model on our docs). RAG was strictly better — the model could see the actual docs at query time, not just patterns from training. We pivoted to RAG; quality was much higher.
The pattern: fine-tuning is a heavier hammer than the problem usually requires.
When fine-tuning is the right call:
Curate the data carefully. Fine-tune quality is data quality. 1,000 high-quality examples beat 10,000 noisy ones. Have a domain expert review the dataset.
Use LoRA, not full fine-tuning. For most cases, LoRA gives 95% of the benefit at a fraction of the cost. We've never needed full fine-tuning in production.
Start small. Fine-tune on 500-1000 examples first. See if it improves. Scale up only if the small experiment shows promise.
Hold out an eval set that's never seen during training. Compare fine-tuned vs base model on this set; this is your honest improvement metric.
Plan for retraining. Production data shifts; fine-tunes go stale. Have a pipeline that re-trains on new data periodically.
Monitor drift. A fine-tuned model that was great at launch can degrade quietly as inputs change. Alert on quality metrics.
Sometimes the right answer is a combination:
For the customer-facing case (case 3 above), we ended up with a fine-tuned model + RAG over current product docs + a few in-context examples for very specific request shapes. Each piece contributed something the others couldn't.
For our actually-fine-tuned cases:
For comparison, a few-shot or RAG approach has zero training cost and minimal engineer time after the initial integration.
Try few-shot first. Spend a day on a careful few-shot prompt. Measure quality. If it's good enough, you're done.
Try RAG next if the task involves information not in the base model. Most "we should fine-tune on our docs" cases are actually RAG cases.
Quantify the gap. Fine-tuning is reasonable when you have a specific quality gap that few-shot/RAG can't close, and you can measure how big the gap is. "We fine-tune because we're a serious AI company" is not a reason.
Use LoRA, not full fine-tuning. Almost always.
Plan for retraining. Fine-tunes age. The pipeline to retrain is part of the cost.
Don't fine-tune for tasks that change quickly. Information that's fresh today is stale tomorrow. RAG handles freshness; fine-tuning doesn't.
The fine-tuning marketing pitch is appealing — your custom model, your domain expertise, your competitive moat. The reality is that few-shot and RAG cover most cases at a fraction of the operational cost. Fine-tuning has its place, but a smaller place than the marketing suggests. Most teams over-fit (pun intended) and end up maintaining a fine-tune that doesn't outperform a well-engineered prompt.
Save fine-tuning for the cases where it's clearly needed. The rest of the time, simpler approaches win.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.