We've shipped four production RAG applications. Each one taught us something. The end-to-end pattern that works.

On this page

Building RAG Applications: A Complete Guide to Retrieval-Augmented Generation

We've now shipped four production RAG applications: a customer support assistant, an internal knowledge search, a product documentation Q&A, and an analytics-summary tool. Each one taught us something different. This is the end-to-end pattern we'd use for the next one — including the parts we wish we'd known on the first.

The shape of every RAG system #

The flow is always the same:

Ingest documents → chunk → embed → store
At query time: embed query → retrieve relevant chunks → optionally re-rank
Build prompt with retrieved chunks → call LLM → return response

The interesting questions are at each arrow.

Step 1: Chunking, the part most teams underestimate #

Default chunking strategies (split every N tokens) work for prose. They fail for technical documentation where structure matters: code blocks get split mid-function, tables lose their headers, headings end up in different chunks than their content.

We landed on a recursive splitter that respects markdown structure:

First, split by H2 headers (preserving heading + body together)
Within sections > 800 tokens, split by paragraph
Never split mid-code-block

For our documentation corpus, this single change improved retrieval recall@5 by ~12 percentage points compared to fixed-token chunking.

The chunk size is also a knob. Smaller chunks = more precise retrieval, less context per chunk; larger chunks = more context, less precision. We use 512-768 tokens with 128 token overlap between consecutive chunks.

Step 2: Embeddings — model choice matters more than people say #

We tried text-embedding-3-small (OpenAI), text-embedding-3-large, and a few open-weights options. Recall differences were measurable:

Model	Recall@5	Cost per 1M tokens
text-embedding-3-small (1536d)	84%	$0.02
text-embedding-3-large (3072d)	91%	$0.13
bge-large-en-v1.5 (self-hosted)	88%	~$0.04 (compute)

For us, text-embedding-3-large justified its cost. The +7pp recall translated to noticeably better answers in eval. For higher-volume / less-quality-sensitive applications, small is fine.

Step 3: Vector storage — keep it simple #

We've used pgvector, Pinecone, and Qdrant. For most teams, pgvector is the right answer:

It piggybacks on your existing Postgres infrastructure
Hybrid queries (vector + SQL filters) are first-class
280k vectors at 1536 dimensions is comfortable on a single instance

Pinecone is tempting for its simplicity but the cost adds up at our volume. Qdrant has the best raw performance but the operational overhead is real.

Step 4: Hybrid retrieval — dense + BM25 #

Pure dense retrieval misses queries with rare keywords (specific product codes, version numbers, error messages). We run dense and BM25 in parallel and combine with reciprocal rank fusion:

python.python

def rrf(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for i, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
    return sorted(scores, key=scores.get, reverse=True)

dense_results = vector_search(query)
bm25_results = bm25_search(query)
fused = rrf([dense_results, bm25_results])

Hybrid retrieval improved our recall@5 by another 4-6 percentage points across all four applications.

Step 5: Re-ranking — the most underused step #

After hybrid retrieval, we have ~25 candidate chunks. Sending all 25 to the LLM is wasteful (cost) and noisy (quality). Re-ranking picks the best ones.

We use a small cross-encoder (bge-reranker-v2-m3) to score each candidate against the query. It's slower than retrieval (~80ms for 25 candidates) but consistently improves answer quality.

Re-ranking is the lever that's most often left on the table. Most RAG systems retrieve 25 chunks, send 8-10 to the LLM, and never rerank. Adding re-ranking lifted our recall@5 by 8-10 percentage points more.

Step 6: Prompt structure that anchors the answer #

The prompt we send to the LLM follows a fixed structure:

code

You are answering using ONLY the numbered context snippets below.
For every claim, append a citation like [3] referring to the snippet number.
If the answer is not in the context, reply exactly:
  "I don't have that information in the documentation."
Do not paraphrase claims that lack a citation.

Context:
[1] {chunk 1}
[2] {chunk 2}
...

Question: {user query}

The "do not paraphrase claims that lack a citation" sentence is critical. Adding it dropped our hallucination rate by ~6 percentage points.

Step 7: Confidence thresholds #

If the top re-ranker score is below a threshold, we don't even call the LLM. We return:

json.json

{
  "answer": "I don't have that information in the documentation.",
  "confidence": "low",
  "sources": []
}

This catches out-of-scope queries before they become hallucinations. Threshold tuning: we set ours at 0.35 (cosine similarity from re-ranker), tuned empirically against our eval set.

What we got wrong on iteration 1 #

A few things we'd do differently from the start:

Treating chunking as solved. Default chunking is fine for blog posts, terrible for technical docs. Spend the time to get it right.

Skipping re-ranking. "It's another step, costs more latency." But it's the cheapest +5-10pp recall lift available.

Trusting one eval question type. Our first eval was 50 fact-lookup questions. We optimized for those. Production turned out to also have lots of "compare X and Y" and "summarize the docs about Z" — different shapes, different optimal retrieval strategies. Eval breadth matters.

Logging only retrieval, not the actual prompt. When debugging "why did the model give a vague answer," we needed to see exactly what was sent. Logging the chunks isn't enough — we needed the assembled prompt.

Operational details that bite #

Things people don't talk about but that matter in production:

Embedding regeneration when the model upgrades. When you swap text-embedding-3-small for large, you have to re-embed every document. For 280k documents, that's a 6-hour batch job. Plan for it.

Stale documents. When a doc is updated, you need to re-chunk and re-embed. Our pipeline runs nightly; for higher-frequency updates you want streaming.

Citation accuracy. The model sometimes hallucinates citation numbers (citing chunk [7] when the answer came from chunk [3]). We post-process responses to verify citations match the actual content. Mismatches get flagged.

Multi-language. Embedding models vary in their multilingual support. If your corpus has mixed languages, test specifically — text-embedding-3-small does better on multilingual than its predecessors, but bge-m3 is purpose-built for it.

For our customer support assistant (most-tuned of the four):

Metric	Value
Documents	280k
Recall @ 8 (after re-ranker)	95%
Hallucination rate (LLM-graded)	5.7%
p95 query latency	2.1s
Cost per query	$0.0029

The hallucination rate dropping from ~14% (before re-ranker + confidence threshold + strict prompt) to 5.7% was the most impactful improvement.

What I'd tell a team starting #

Build an eval set first. 100-200 hand-labeled QA pairs. Without it, every change is a guess.

Get retrieval right before touching the model. The temptation is to "use a smarter LLM" when answers are bad. Usually the LLM is fine; retrieval is the bottleneck.

Always re-rank. The cheapest quality lift.

Make "I don't know" easy. Most production hallucinations come from the model trying to be helpful when it shouldn't.

Build the system to log the actual prompt sent. The first time something goes wrong, you'll be glad you have it.

The pattern above isn't novel. It's just the result of shipping four of these and learning what's worth doing well. Most of the failures we've seen at other teams come from skipping one of these steps. Each of them sounds optional. None of them are.

Building RAG Applications: A Complete Guide to Retrieval Augmented Generation

Building RAG Applications: A Complete Guide to Retrieval-Augmented Generation

The shape of every RAG system #

Step 1: Chunking, the part most teams underestimate #

Step 2: Embeddings — model choice matters more than people say #

Step 3: Vector storage — keep it simple #

Step 4: Hybrid retrieval — dense + BM25 #

Step 5: Re-ranking — the most underused step #

Step 6: Prompt structure that anchors the answer #

Step 7: Confidence thresholds #

What we got wrong on iteration 1 #

Operational details that bite #

What I'd tell a team starting #

Stay Updated

RAG in Production: Reliability, Latency, and Cost for LLM Apps

Systemd Tricks We Use to Keep Services Boring

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Agent Memory: Short-Term, Long-Term, and When You Need Neither

Guardrails for Production LLMs: Input and Output Filtering That Holds

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes