Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.

On this page

Hybrid Search: Combining BM25 and Embeddings for Better RAG

The first version of every RAG system we've shipped used pure vector search — embed the query, find the nearest chunks in cosine-similarity space, send them to the LLM. Works great for queries that match documents semantically. Works terribly for queries that depend on specific identifiers, version numbers, error codes, or product names. We've watched users type the exact name of a feature into our support assistant and get back chunks about something else.

The fix that consistently moves recall is hybrid search — running BM25 (the classic keyword-based ranker) alongside vector search and fusing the results. Across four production RAG features, hybrid added 4–7 percentage points of recall@10 over vector-only. This post is how we run it.

Where vector search fails #

Vector embeddings encode meaning, not surface form. That's their strength most of the time and their weakness for a specific class of queries:

Identifiers — error codes, SKUs, customer IDs, file paths. "ERR_TLS_HANDSHAKE_FAILED" and "the handshake failed during TLS negotiation" describe similar concepts but the embedding distance can be surprisingly large. Pure vector retrieval ranks the prose version first; the user wanted the error code one.
Exact phrases — quoted strings, function names, command-line flags. "max_connections" should match documents that contain that literal config key, not just documents that talk about connection pooling abstractly.
Rare terms — niche product names, internal jargon. The embedding model hasn't seen them often enough; the vector is close to lots of unrelated things.

For these, BM25 — which counts term overlap weighted by inverse document frequency — does much better.

Where BM25 fails #

BM25 has the symmetric problem. It looks at words, not meaning:

"how do I fix a 502 error" doesn't match a doc that uses "bad gateway response" — no shared keywords.
Synonyms, paraphrases, abbreviations all defeat it.
Cross-language queries are hopeless.

These are exactly where vector search shines.

The two methods have complementary failure modes, which is why combining them works.

Reciprocal rank fusion (RRF)#

The simplest way to combine ranked lists from multiple search systems. For each candidate document, compute:

code

score(doc) = Σ over each list  ( 1 / (k + rank_in_that_list) )

k is a smoothing constant (60 is the canonical default). Documents that rank highly in multiple lists score high; documents that rank highly in only one still get partial credit.

No tuning of relative weights, no need to normalize raw scores from different systems, no calibration — just rankings.

python.python

def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for i, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + i + 1)
    return sorted(scores, key=lambda d: scores[d], reverse=True)

dense_results = vector_search(query)         # list of doc_ids ranked
bm25_results  = bm25_search(query)           # list of doc_ids ranked
fused = rrf([dense_results, bm25_results])[:10]

That's the whole algorithm. Six lines.

How we run it in production #

Backend. Postgres with pgvector (for embeddings) plus Postgres's built-in full-text search via tsvector and ts_rank_cd (for BM25-ish ranking). Both indexes on the same table; both queries against the same data. No separate Elasticsearch needed for our scale.

Pipeline at query time:

Run both queries in parallel against the same Document table.
- Vector query returns top-25 by cosine similarity.
- FTS query returns top-25 by ts_rank.
Fuse with RRF (k=60).
Take top 8–10 for the LLM context.

We added re-ranking on top later (a small cross-encoder scores fused results against the query), but RRF on its own moved the needle enough to justify the complexity.

What we measured #

A small eval set of 200 hand-labeled queries, comparing top-10 retrieval against the labeled relevant chunks:

Method	Recall@10
Pure vector (text-embedding-3-large)	84%
Pure BM25 (Postgres FTS)	71%
RRF fusion	91%
RRF + cross-encoder rerank	95%

The vector vs BM25 gap is real (vector wins overall) but the fusion beats either alone by a meaningful margin. The fused result is better than vector for keyword-heavy queries AND better than BM25 for semantic ones, because RRF lets each method "vote" for the docs it found.

The rerank step adds another 4pp on top but adds latency (~80ms) and infra complexity. Whether it's worth it depends on the workload.

Latency #

Two parallel queries instead of one. In practice:

Vector search (pgvector with HNSW index): ~12ms p50
FTS search (Postgres tsvector): ~8ms p50
Both in parallel: ~14ms p50 (max of the two plus tiny RRF overhead)

Negligible cost for the recall lift.

Common mistakes #

Running vector and BM25 against different document sets. If they have different ingestion pipelines (different chunking, different normalization), the fused list returns documents that don't exist in one or the other — looks weird, debugging is awkward. Same source, same chunks.

Weighting BM25 to compensate for "vector is better." With RRF you don't need weighting. If you find yourself adding scalar weights to one side, you're rebuilding what RRF already does.

Forgetting to normalize text consistently. Both pipelines need to apply the same tokenization, lowercasing, stemming. Mismatches mean BM25 ranks differ for queries that look the same after normalization.

Skipping eval. Hybrid sounds obviously better, but the win is workload-dependent. Run the eval against your actual labeled set. For our customer support assistant the gap was 7pp; for an internal doc search it was closer to 3pp.

What we'd skip #

A few patterns we tried and dropped:

Learned weights for combining scores. The fixed RRF with k=60 worked as well as anything we trained.
Query expansion with synonyms before BM25. Too much complexity for marginal gain.
Two separate vector indexes (different models) fused. Doesn't move the needle the way orthogonal methods (vector + keyword) do.

What to read next #

Building RAG applications: a complete guide — the end-to-end RAG pattern this fits into
Embedding models comparison: choosing the right model — the vector side of hybrid
Vector databases for AI: Pinecone vs Weaviate vs Chroma vs pgvector — where these indexes live
Embeddings drift detection — when "similar enough" stops being similar — what happens after you've got retrieval working

Hybrid search is one of those changes where the implementation is small but the recall improvement is consistent. If your RAG system has been pure-vector and you've noticed certain query types just don't work — error codes, version numbers, specific product names — this is the cheapest fix you'll find.

Hybrid Search — Combining BM25 and Embeddings for Better RAG

Hybrid Search: Combining BM25 and Embeddings for Better RAG

Where vector search fails #

Where BM25 fails #

Reciprocal rank fusion (RRF)#

How we run it in production #

What we measured #

Latency #

Common mistakes #

What we'd skip #

What to read next #

Stay Updated

Handling Vulnerabilities in Production — What We Actually Do

Cross-Cloud Identity Federation — Patterns That Replaced Our Long-Lived Keys

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas