We've shipped four production RAG applications. Each one taught us something. The end-to-end pattern that works.
We've now shipped four production RAG applications: a customer support assistant, an internal knowledge search, a product documentation Q&A, and an analytics-summary tool. Each one taught us something different. This is the end-to-end pattern we'd use for the next one — including the parts we wish we'd known on the first.
The flow is always the same:
The interesting questions are at each arrow.
Default chunking strategies (split every N tokens) work for prose. They fail for technical documentation where structure matters: code blocks get split mid-function, tables lose their headers, headings end up in different chunks than their content.
We landed on a recursive splitter that respects markdown structure:
For our documentation corpus, this single change improved retrieval recall@5 by ~12 percentage points compared to fixed-token chunking.
The chunk size is also a knob. Smaller chunks = more precise retrieval, less context per chunk; larger chunks = more context, less precision. We use 512-768 tokens with 128 token overlap between consecutive chunks.
We tried text-embedding-3-small (OpenAI), text-embedding-3-large, and a few open-weights options. Recall differences were measurable:
| Model | Recall@5 | Cost per 1M tokens |
|---|---|---|
| text-embedding-3-small (1536d) | 84% | $0.02 |
| text-embedding-3-large (3072d) | 91% | $0.13 |
| bge-large-en-v1.5 (self-hosted) | 88% | ~$0.04 (compute) |
For us, text-embedding-3-large justified its cost. The +7pp recall translated to noticeably better answers in eval. For higher-volume / less-quality-sensitive applications, small is fine.
We've used pgvector, Pinecone, and Qdrant. For most teams, pgvector is the right answer:
Pinecone is tempting for its simplicity but the cost adds up at our volume. Qdrant has the best raw performance but the operational overhead is real.
Pure dense retrieval misses queries with rare keywords (specific product codes, version numbers, error messages). We run dense and BM25 in parallel and combine with reciprocal rank fusion:
def rrf(rankings, k=60):
scores = {}
for ranking in rankings:
for i, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
return sorted(scores, key=scores.get, reverse=True)
dense_results = vector_search(query)
bm25_results = bm25_search(query)
fused = rrf([dense_results, bm25_results])
Hybrid retrieval improved our recall@5 by another 4-6 percentage points across all four applications.
After hybrid retrieval, we have ~25 candidate chunks. Sending all 25 to the LLM is wasteful (cost) and noisy (quality). Re-ranking picks the best ones.
We use a small cross-encoder (bge-reranker-v2-m3) to score each candidate against the query. It's slower than retrieval (~80ms for 25 candidates) but consistently improves answer quality.
Re-ranking is the lever that's most often left on the table. Most RAG systems retrieve 25 chunks, send 8-10 to the LLM, and never rerank. Adding re-ranking lifted our recall@5 by 8-10 percentage points more.
The prompt we send to the LLM follows a fixed structure:
You are answering using ONLY the numbered context snippets below.
For every claim, append a citation like [3] referring to the snippet number.
If the answer is not in the context, reply exactly:
"I don't have that information in the documentation."
Do not paraphrase claims that lack a citation.
Context:
[1] {chunk 1}
[2] {chunk 2}
...
Question: {user query}
The "do not paraphrase claims that lack a citation" sentence is critical. Adding it dropped our hallucination rate by ~6 percentage points.
If the top re-ranker score is below a threshold, we don't even call the LLM. We return:
{
"answer": "I don't have that information in the documentation.",
"confidence": "low",
"sources": []
}
This catches out-of-scope queries before they become hallucinations. Threshold tuning: we set ours at 0.35 (cosine similarity from re-ranker), tuned empirically against our eval set.
A few things we'd do differently from the start:
Treating chunking as solved. Default chunking is fine for blog posts, terrible for technical docs. Spend the time to get it right.
Skipping re-ranking. "It's another step, costs more latency." But it's the cheapest +5-10pp recall lift available.
Trusting one eval question type. Our first eval was 50 fact-lookup questions. We optimized for those. Production turned out to also have lots of "compare X and Y" and "summarize the docs about Z" — different shapes, different optimal retrieval strategies. Eval breadth matters.
Logging only retrieval, not the actual prompt. When debugging "why did the model give a vague answer," we needed to see exactly what was sent. Logging the chunks isn't enough — we needed the assembled prompt.
Things people don't talk about but that matter in production:
Embedding regeneration when the model upgrades. When you swap text-embedding-3-small for large, you have to re-embed every document. For 280k documents, that's a 6-hour batch job. Plan for it.
Stale documents. When a doc is updated, you need to re-chunk and re-embed. Our pipeline runs nightly; for higher-frequency updates you want streaming.
Citation accuracy. The model sometimes hallucinates citation numbers (citing chunk [7] when the answer came from chunk [3]). We post-process responses to verify citations match the actual content. Mismatches get flagged.
Multi-language. Embedding models vary in their multilingual support. If your corpus has mixed languages, test specifically — text-embedding-3-small does better on multilingual than its predecessors, but bge-m3 is purpose-built for it.
For our customer support assistant (most-tuned of the four):
| Metric | Value |
|---|---|
| Documents | 280k |
| Recall @ 8 (after re-ranker) | 95% |
| Hallucination rate (LLM-graded) | 5.7% |
| p95 query latency | 2.1s |
| Cost per query | $0.0029 |
The hallucination rate dropping from ~14% (before re-ranker + confidence threshold + strict prompt) to 5.7% was the most impactful improvement.
Build an eval set first. 100-200 hand-labeled QA pairs. Without it, every change is a guess.
Get retrieval right before touching the model. The temptation is to "use a smarter LLM" when answers are bad. Usually the LLM is fine; retrieval is the bottleneck.
Always re-rank. The cheapest quality lift.
Make "I don't know" easy. Most production hallucinations come from the model trying to be helpful when it shouldn't.
Build the system to log the actual prompt sent. The first time something goes wrong, you'll be glad you have it.
The pattern above isn't novel. It's just the result of shipping four of these and learning what's worth doing well. Most of the failures we've seen at other teams come from skipping one of these steps. Each of them sounds optional. None of them are.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Run retrieval-augmented generation at scale. Chunking, caching, and observability.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.