Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.
We run a customer-facing RAG system over ~280k internal documents. After six months in production, we did a quarter-long quality push and brought our answer-with-source-mismatch rate from 14.2% to 5.7% without changing the LLM. Every win came from the retrieval side.
text-embedding-3-small (1536-dim)gpt-4o-mini with a vanilla "answer using only the context below" promptQuality eval against 1,200 hand-labeled QA pairs:
Fixed-token chunking cut sentences in half. We switched to semantic chunking using a recursive splitter that respects markdown headings, paragraph breaks, and code fences.
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
headers = [("##", "h2"), ("###", "h3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
recursive = RecursiveCharacterTextSplitter(
chunk_size=600,
chunk_overlap=80,
separators=["\n\n", "\n", ". ", "? ", "! ", " "],
)
def chunk(doc):
sections = md_splitter.split_text(doc.text)
chunks = []
for s in sections:
for piece in recursive.split_text(s.page_content):
chunks.append({
"text": piece,
"metadata": {**doc.metadata, **s.metadata},
})
return chunks
Recall@8 jumped from 76% → 84%. Simple, free.
Pure vector search misses queries with rare keywords (model numbers, product codes). We added BM25 in parallel and combined the two with reciprocal rank fusion:
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
scores = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
Then we re-rank the fused top-25 with a cross-encoder (bge-reranker-v2-m3) and take the top-8. Adds ~120ms but recall@8 went to 95%.
Users ask "what's the latency for X?" but our docs say "p99 response time." A small LLM call rewrites the user's query into 2-3 alternative phrasings:
REWRITE_PROMPT = """Generate 3 alternative phrasings of this query that an internal doc might use.
Output JSON: {"queries": ["...", "...", "..."]}.
Query: {q}"""
alts = json.loads(llm_small(REWRITE_PROMPT.format(q=user_query)))["queries"]
all_results = [retrieve(q) for q in [user_query, *alts]]
fused = rrf(all_results)
Costs ~$0.0001 per query (3¢/1k queries). Worth it.
The biggest hallucination reduction came from a prompt change, not retrieval:
You are answering using ONLY the numbered context snippets below.
For every factual claim, append a citation like [3] referring to the snippet.
If the answer is not in the context, reply exactly:
"I don't have that information."
Do not paraphrase claims that lack a citation.
Context:
[1] {chunk_1}
[2] {chunk_2}
...
Adding the "do not paraphrase claims that lack a citation" sentence dropped hallucinations by ~6 percentage points on its own.
If the top-1 re-ranker score is below 0.35, we don't even call the LLM — we return the "I don't have that information" answer directly. This caught most of the residual hallucinations on out-of-scope queries.
top_score = reranker_scores[0]
if top_score < 0.35:
return {
"answer": "I don't have that information.",
"confidence": "low",
"sources": [],
}
| Metric | Before | After |
|---|---|---|
| Recall @ 8 | 76% | 95% |
| Answer correct (LLM-graded) | 71% | 88% |
| Hallucination rate | 14.2% | 5.7% |
| p95 latency | 1.8s | 2.1s |
| Cost per query | $0.0021 | $0.0029 |
text-embedding-3-large): +1% recall, +2× cost. Not worth it.Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.
Explore more articles in this category
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
A real-world model fallback guide for customer-facing AI systems, covering how one team preserved response quality and support SLAs during a partial provider degradation.