Six months running RAG in production taught us that the retrieval step matters far more than the model. Concrete techniques that moved the needle, with before/after numbers.

On this page

Embedding Quality in RAG: How We Cut Hallucinations by 60%

We run a customer-facing RAG system over ~280k internal documents. After six months in production, we did a quarter-long quality push and brought our answer-with-source-mismatch rate from 14.2% to 5.7% without changing the LLM. Every win came from the retrieval side.

The Baseline We Inherited #

Embeddings: text-embedding-3-small (1536-dim)
Chunking: 512 tokens, 64-token overlap
Vector store: pgvector with HNSW index
Top-k: 8 chunks per query
LLM: gpt-4o-mini with a vanilla "answer using only the context below" prompt

Quality eval against 1,200 hand-labeled QA pairs:

Exact answer present in retrieved chunks: 76%
Answer correct (LLM-graded): 71%
Hallucination rate: 14.2%

Win #1: Re-Chunk by Semantic Boundaries (+8% recall)#

Fixed-token chunking cut sentences in half. We switched to semantic chunking using a recursive splitter that respects markdown headings, paragraph breaks, and code fences.

python.python

from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter

headers = [("##", "h2"), ("###", "h3")]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
recursive = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=80,
    separators=["\n\n", "\n", ". ", "? ", "! ", " "],
)

def chunk(doc):
    sections = md_splitter.split_text(doc.text)
    chunks = []
    for s in sections:
        for piece in recursive.split_text(s.page_content):
            chunks.append({
                "text": piece,
                "metadata": {**doc.metadata, **s.metadata},
            })
    return chunks

Recall@8 jumped from 76% → 84%. Simple, free.

Win #2: Hybrid Search (BM25 + Dense), Then Re-Rank (+11% recall)#

Pure vector search misses queries with rare keywords (model numbers, product codes). We added BM25 in parallel and combined the two with reciprocal rank fusion:

python.python

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

Then we re-rank the fused top-25 with a cross-encoder (bge-reranker-v2-m3) and take the top-8. Adds ~120ms but recall@8 went to 95%.

Win #3: Query Rewriting (+4% answer accuracy)#

Users ask "what's the latency for X?" but our docs say "p99 response time." A small LLM call rewrites the user's query into 2-3 alternative phrasings:

python.python

REWRITE_PROMPT = """Generate 3 alternative phrasings of this query that an internal doc might use.
Output JSON: {"queries": ["...", "...", "..."]}.
Query: {q}"""

alts = json.loads(llm_small(REWRITE_PROMPT.format(q=user_query)))["queries"]
all_results = [retrieve(q) for q in [user_query, *alts]]
fused = rrf(all_results)

Costs ~$0.0001 per query (3¢/1k queries). Worth it.

Win #4: Stricter Prompt + Citation Requirement (-6% hallucination)#

The biggest hallucination reduction came from a prompt change, not retrieval:

code

You are answering using ONLY the numbered context snippets below.
For every factual claim, append a citation like [3] referring to the snippet.
If the answer is not in the context, reply exactly:
  "I don't have that information."
Do not paraphrase claims that lack a citation.

Context:
[1] {chunk_1}
[2] {chunk_2}
...

Adding the "do not paraphrase claims that lack a citation" sentence dropped hallucinations by ~6 percentage points on its own.

Win #5: Reject Low-Confidence Retrievals #

If the top-1 re-ranker score is below 0.35, we don't even call the LLM — we return the "I don't have that information" answer directly. This caught most of the residual hallucinations on out-of-scope queries.

python.python

top_score = reranker_scores[0]
if top_score < 0.35:
    return {
        "answer": "I don't have that information.",
        "confidence": "low",
        "sources": [],
    }

The Final Numbers #

Metric	Before	After
Recall @ 8	76%	95%
Answer correct (LLM-graded)	71%	88%
Hallucination rate	14.2%	5.7%
p95 latency	1.8s	2.1s
Cost per query	$0.0021	$0.0029

What Didn't Work #

Larger embedding model (text-embedding-3-large): +1% recall, +2× cost. Not worth it.
Fine-tuning embeddings on our domain: marginal gains, weeks of work.
Larger top-k (16, 32): noise crowded out signal once we had a re-ranker.

Build an eval set first. 100-200 hand-labeled QA pairs is enough to make decisions on. Without numbers you'll burn weeks chasing intuition.
Fix retrieval before touching the model. A bigger model hides retrieval bugs; it doesn't fix them.
Always re-rank. Cross-encoders are slow but consistently lift quality more than any other change.
Make "I don't know" easy. Most hallucinations come from the model trying to be helpful when it shouldn't.
Track hallucination rate, not just accuracy. Wrong-with-confidence is worse than refusing to answer.

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Embedding Quality in RAG: How We Cut Hallucinations by 60%

The Baseline We Inherited #

Win #1: Re-Chunk by Semantic Boundaries (+8% recall)#

Win #2: Hybrid Search (BM25 + Dense), Then Re-Rank (+11% recall)#

Win #3: Query Rewriting (+4% answer accuracy)#

Win #4: Stricter Prompt + Citation Requirement (-6% hallucination)#

Win #5: Reject Low-Confidence Retrievals #

The Final Numbers #

What Didn't Work #

Stay Updated

Database Migrations Without Downtime: Patterns From Three Real Cutovers

Zero Trust on AWS: Lessons From Implementing IAM Identity Center

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas