Fine-tuning is rarely the right answer. We've fine-tuned three times in two years; few-shot or RAG was correct for everything else. The decision criteria.

On this page

Fine-Tuning vs Few-Shot: When to Use Each

When teams want a model to "do something specific to our domain," the first instinct is often "let's fine-tune." After running real LLM features in production for a couple of years, fine-tuning has been the right answer maybe 10% of the time. Few-shot prompting or retrieval-augmented generation usually wins. This post is the decision tree we now use, with the actual experiments behind it.

The three approaches #

Few-shot prompting: include 2-10 examples in the prompt itself. The model "learns" the pattern from the examples in context. No training. Works at runtime.

Retrieval-augmented generation (RAG): store relevant content in a knowledge base, retrieve at query time, include in prompt as context. The model uses the retrieved info. No training.

Fine-tuning: actually train the model (or a LoRA adapter on top of it) on your specific data. Training takes hours to days. Inference uses the customized model.

Each has different strengths. The wrong choice for the problem wastes weeks or months.

Decision criteria, in priority order #

Our decision tree:

Does the task need information that's not in the base model?
- "Answer questions about our internal docs" → RAG (the docs aren't in the base model).
- "Always cite the latest data on X" → RAG (the data updates faster than fine-tuning cycles).
Does the task have a stable, narrow output format that the base model doesn't follow well?
- "Always respond with this specific JSON schema" → Few-shot first; fine-tune if few-shot doesn't reliably produce the format.
- "Match this house style of writing" → Few-shot first; fine-tune if subtle stylistic patterns matter.
Does the task require domain-specific reasoning the base model lacks?
- "Diagnose specific patterns in our medical-imaging data" → Probably fine-tune, possibly with RAG.
- "Translate between two niche programming dialects" → Fine-tune.
Is the volume high enough that prompt-token cost dominates?
- "Millions of calls/day with the same long preamble" → Fine-tuning lets you skip the few-shot examples in the prompt, saving tokens. Cost-driven.
Does the task work reliably with few-shot already?
- If yes → Use few-shot. Cheaper, simpler, more flexible.

The first time someone says "we should fine-tune" the answer is usually "have you tried RAG or few-shot first."

Why we default to few-shot or RAG #

Few-shot and RAG have advantages that compound:

Faster iteration. Changing a prompt is seconds. Changing a fine-tuned model is hours.

More controllable. When few-shot output is wrong, edit the prompt. When fine-tuned output is wrong, you may need to retrain (or prompt-engineer the fine-tuned model to compensate).

Easier to debug. "The model used this context to produce this output" is more inspectable than "the trained weights produced this output."

Cheaper. Inference cost is similar to base. No training cost. No GPU rental for fine-tuning.

Works on the latest models. When OpenAI/Anthropic ship a new model, your few-shot prompts work immediately. Fine-tunes have to be redone.

The cost: prompts are longer, inference is slower per call, and you can't bake in subtle behavior that's hard to express in examples.

When we've actually fine-tuned #

Three times in two years:

Case 1: A specific format with strict requirements. A customer needed a model that always produced output in a specific structured format with hundreds of fields. Few-shot worked 95% of the time; the 5% failure rate was unacceptable for the use case. Fine-tuning got it to 99.5%. Worth it.

Case 2: A language model for a niche programming syntax. Our customer's product had a custom DSL. The base models knew nothing about it. Few-shot with examples got reasonable but not great output. Fine-tuning on a corpus of valid DSL programs improved quality dramatically and cut prompt size (no need for examples).

Case 3: Tone and voice for a customer-facing product. A product needed the model to consistently adopt a specific tone. Few-shot prompts helped but were inconsistent — the tone varied with the input. Fine-tuning on a curated dataset of "good" tone examples fixed it.

In each case, we had tried few-shot first, hit a quality ceiling that few-shot couldn't break, and concluded fine-tuning was needed.

What we tried fine-tuning and abandoned #

Cases where we considered or attempted fine-tuning and reverted:

A categorization task (classify support tickets into 30+ categories). Few-shot with carefully chosen examples got 91% accuracy. We fine-tuned to push it higher; got 93%. The cost: training pipeline, monitoring drift, periodic retraining. The quality gain wasn't worth the operational overhead. Reverted to few-shot.

A summarization task with specific length requirements. Few-shot with max_tokens limit worked. Fine-tuning was attempted to "really nail the style"; the result wasn't measurably better than few-shot.

Domain-specific Q&A (we initially tried fine-tuning a model on our docs). RAG was strictly better — the model could see the actual docs at query time, not just patterns from training. We pivoted to RAG; quality was much higher.

The pattern: fine-tuning is a heavier hammer than the problem usually requires.

How to do fine-tuning well #

When fine-tuning is the right call:

Curate the data carefully. Fine-tune quality is data quality. 1,000 high-quality examples beat 10,000 noisy ones. Have a domain expert review the dataset.

Use LoRA, not full fine-tuning. For most cases, LoRA gives 95% of the benefit at a fraction of the cost. We've never needed full fine-tuning in production.

Start small. Fine-tune on 500-1000 examples first. See if it improves. Scale up only if the small experiment shows promise.

Hold out an eval set that's never seen during training. Compare fine-tuned vs base model on this set; this is your honest improvement metric.

Plan for retraining. Production data shifts; fine-tunes go stale. Have a pipeline that re-trains on new data periodically.

Monitor drift. A fine-tuned model that was great at launch can degrade quietly as inputs change. Alert on quality metrics.

Hybrid: fine-tuned model + RAG + few-shot #

Sometimes the right answer is a combination:

Fine-tune a base model for tone / format / style
Use RAG to inject current factual information
Use few-shot for the specific task structure

For the customer-facing case (case 3 above), we ended up with a fine-tuned model + RAG over current product docs + a few in-context examples for very specific request shapes. Each piece contributed something the others couldn't.

Cost reality #

For our actually-fine-tuned cases:

Training: ~$100-400 per training run (depending on dataset size and model)
Recurring training (every 3 months as data grows): same cost per run
Engineer time: ~1 week initial setup, ~1 day per re-training
Hosting: ~$50-200/month for fine-tuned model serving (OpenAI hosted) or self-hosted

For comparison, a few-shot or RAG approach has zero training cost and minimal engineer time after the initial integration.

What I'd tell a team thinking about fine-tuning #

Try few-shot first. Spend a day on a careful few-shot prompt. Measure quality. If it's good enough, you're done.

Try RAG next if the task involves information not in the base model. Most "we should fine-tune on our docs" cases are actually RAG cases.

Quantify the gap. Fine-tuning is reasonable when you have a specific quality gap that few-shot/RAG can't close, and you can measure how big the gap is. "We fine-tune because we're a serious AI company" is not a reason.

Use LoRA, not full fine-tuning. Almost always.

Plan for retraining. Fine-tunes age. The pipeline to retrain is part of the cost.

Don't fine-tune for tasks that change quickly. Information that's fresh today is stale tomorrow. RAG handles freshness; fine-tuning doesn't.

The fine-tuning marketing pitch is appealing — your custom model, your domain expertise, your competitive moat. The reality is that few-shot and RAG cover most cases at a fraction of the operational cost. Fine-tuning has its place, but a smaller place than the marketing suggests. Most teams over-fit (pun intended) and end up maintaining a fine-tune that doesn't outperform a well-engineered prompt.

Save fine-tuning for the cases where it's clearly needed. The rest of the time, simpler approaches win.

Fine-tuning vs Few-Shot Learning: When to Use Each Approach

Fine-Tuning vs Few-Shot: When to Use Each

The three approaches #

Decision criteria, in priority order #

Why we default to few-shot or RAG #

When we've actually fine-tuned #

What we tried fine-tuning and abandoned #

How to do fine-tuning well #

Hybrid: fine-tuned model + RAG + few-shot #

Cost reality #

What I'd tell a team thinking about fine-tuning #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Guardrails for Production LLMs: Input and Output Filtering That Holds

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas