Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.
The first version of every RAG system we've shipped used pure vector search — embed the query, find the nearest chunks in cosine-similarity space, send them to the LLM. Works great for queries that match documents semantically. Works terribly for queries that depend on specific identifiers, version numbers, error codes, or product names. We've watched users type the exact name of a feature into our support assistant and get back chunks about something else.
The fix that consistently moves recall is hybrid search — running BM25 (the classic keyword-based ranker) alongside vector search and fusing the results. Across four production RAG features, hybrid added 4–7 percentage points of recall@10 over vector-only. This post is how we run it.
Vector embeddings encode meaning, not surface form. That's their strength most of the time and their weakness for a specific class of queries:
Identifiers — error codes, SKUs, customer IDs, file paths. "ERR_TLS_HANDSHAKE_FAILED" and "the handshake failed during TLS negotiation" describe similar concepts but the embedding distance can be surprisingly large. Pure vector retrieval ranks the prose version first; the user wanted the error code one.
Exact phrases — quoted strings, function names, command-line flags. "max_connections" should match documents that contain that literal config key, not just documents that talk about connection pooling abstractly.
Rare terms — niche product names, internal jargon. The embedding model hasn't seen them often enough; the vector is close to lots of unrelated things.
For these, BM25 — which counts term overlap weighted by inverse document frequency — does much better.
BM25 has the symmetric problem. It looks at words, not meaning:
"how do I fix a 502 error" doesn't match a doc that uses "bad gateway response" — no shared keywords.These are exactly where vector search shines.
The two methods have complementary failure modes, which is why combining them works.
The simplest way to combine ranked lists from multiple search systems. For each candidate document, compute:
score(doc) = Σ over each list ( 1 / (k + rank_in_that_list) )
k is a smoothing constant (60 is the canonical default). Documents that rank highly in multiple lists score high; documents that rank highly in only one still get partial credit.
No tuning of relative weights, no need to normalize raw scores from different systems, no calibration — just rankings.
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranking in rankings:
for i, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
return sorted(scores, key=lambda d: scores[d], reverse=True)
dense_results = vector_search(query) # list of doc_ids ranked
bm25_results = bm25_search(query) # list of doc_ids ranked
fused = rrf([dense_results, bm25_results])[:10]
That's the whole algorithm. Six lines.
Backend. Postgres with pgvector (for embeddings) plus Postgres's built-in full-text search via tsvector and ts_rank_cd (for BM25-ish ranking). Both indexes on the same table; both queries against the same data. No separate Elasticsearch needed for our scale.
Pipeline at query time:
Document table.
We added re-ranking on top later (a small cross-encoder scores fused results against the query), but RRF on its own moved the needle enough to justify the complexity.
A small eval set of 200 hand-labeled queries, comparing top-10 retrieval against the labeled relevant chunks:
| Method | Recall@10 |
|---|---|
| Pure vector (text-embedding-3-large) | 84% |
| Pure BM25 (Postgres FTS) | 71% |
| RRF fusion | 91% |
| RRF + cross-encoder rerank | 95% |
The vector vs BM25 gap is real (vector wins overall) but the fusion beats either alone by a meaningful margin. The fused result is better than vector for keyword-heavy queries AND better than BM25 for semantic ones, because RRF lets each method "vote" for the docs it found.
The rerank step adds another 4pp on top but adds latency (~80ms) and infra complexity. Whether it's worth it depends on the workload.
Two parallel queries instead of one. In practice:
Negligible cost for the recall lift.
Running vector and BM25 against different document sets. If they have different ingestion pipelines (different chunking, different normalization), the fused list returns documents that don't exist in one or the other — looks weird, debugging is awkward. Same source, same chunks.
Weighting BM25 to compensate for "vector is better." With RRF you don't need weighting. If you find yourself adding scalar weights to one side, you're rebuilding what RRF already does.
Forgetting to normalize text consistently. Both pipelines need to apply the same tokenization, lowercasing, stemming. Mismatches mean BM25 ranks differ for queries that look the same after normalization.
Skipping eval. Hybrid sounds obviously better, but the win is workload-dependent. Run the eval against your actual labeled set. For our customer support assistant the gap was 7pp; for an internal doc search it was closer to 3pp.
A few patterns we tried and dropped:
Hybrid search is one of those changes where the implementation is small but the recall improvement is consistent. If your RAG system has been pure-vector and you've noticed certain query types just don't work — error codes, version numbers, specific product names — this is the cheapest fix you'll find.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
OIDC federation between AWS, GCP, and CI providers let us delete every long-lived cloud credential we had. The setup, the gotchas, and the trust-relationship discipline.
Explore more articles in this category
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
Evergreen posts worth revisiting.