Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
We index a few hundred thousand documents with embeddings for our RAG features. The vectors are stable — bge-large produces the same output for the same input — but the quality of retrieval drifts over time as the corpus changes, as queries change, as the world changes. We noticed a quality regression a few months back that had been growing for weeks before any metric caught it. This post is the signals we built afterwards.
Two distinct kinds:
Index drift — the embedded documents and the live queries are increasingly different in shape. The index was built on documentation from two years ago; today's questions are about things the docs don't cover. Retrieval scores stay high (the model finds the closest match) but the closest match isn't relevant anymore. Quality drops without any individual document changing.
Embedding model drift — the underlying model has been updated or replaced. Vectors generated by the new model live in a different space than the old vectors. Comparing them gives noise.
Both are real. Both need monitoring. Index drift is the slower, sneakier one — model drift you usually know about (you deployed it).
Five things we now track:
Top-1 retrieval score distribution. For every live query, we record the cosine similarity of the top retrieved chunk. We track this distribution daily — mean, p50, p25, p10.
A healthy index has retrieval scores clustered above a threshold (for bge-large, top-1 above ~0.65 is "we have a good match"). If the distribution shifts left over time — more queries returning matches below 0.65 — the index is getting worse for the live workload.
This is the leading indicator. Caught the quality regression in our case.
Distance between query and top-k spread. When you retrieve top-5 chunks, healthy retrieval has a clear winner (top-1 much higher than top-5). Drift shows up as a flatter distribution — top-5 is close to top-1, meaning the model isn't confident about which chunk best matches. We track score[0] - score[4] per query.
Citation rate. When the RAG-generated answer cites the retrieved chunks, what fraction of generated answers cite at least one chunk? A healthy system is around 90-95%. Drops here mean the LLM isn't finding usable context in the retrieved chunks (often because the chunks aren't relevant).
Refusal rate. What fraction of answers are "I don't have that information in the documentation"? Should be stable; a rising refusal rate often means the index doesn't cover what users are asking now.
User feedback signals. Thumbs-down rate on responses, follow-up reformulation rate (user asks the same thing differently because the first response was bad). Lagging signal, but reliable.
We chart all five on the same dashboard. Visual inspection catches drift faster than alerts on individual metrics; one number moving in isolation might be noise, but three of five moving together is signal.
The specific incident: Top-1 retrieval scores had been drifting downward for ~6 weeks. Mean had dropped from 0.71 to 0.62. Nobody noticed because:
What had changed: a new feature had launched 8 weeks earlier; users were asking questions about it; the docs hadn't been updated to reflect the launch. Existing chunks were the closest match for new queries but weren't actually relevant.
Fix: identify the queries that had low top-1 scores, group them by topic, write or improve the docs for those topics, re-embed.
The fix took a day. The detection had taken 6 weeks because we weren't watching for it. Post-incident, the dashboards above became standard.
Different content has different change rates:
We use change-data-capture on the content store: any document that's been updated since the last embed gets re-embedded. Nightly batch job runs the diff.
For embedding model upgrades (e.g., moving from bge-base to bge-large), it's a full re-embed of everything. That's a multi-hour batch job at our scale; we run it during off-hours with a feature flag controlling which index version production reads from. Zero-downtime cutover.
When evaluating a new embedding model, we run it as a shadow index for a few weeks before cutting over. Live queries are embedded with both the production model and the candidate; both indexes are queried; both results are stored. We compare:
The shadow index costs extra (double the embedding API spend, double the storage) but the alternative — cut over and hope it's better — has burned us before.
After the shadow period, we cut over the production index if the candidate wins. The shadow remains running for another week as fallback, then is decommissioned.
A few patterns we considered and abandoned:
Anomaly detection on individual query embeddings. "Is this query unusual?" ML-based, complex to tune, generates false positives. Aggregate metrics (the five above) catch the same drift more reliably.
Per-document embedding drift. "Has this document's embedding shifted from last week?" For deterministic models, the answer is no unless the document changed. Not useful as a signal.
Real-time drift alerts. Drift is slow. Daily aggregates catch it; minute-by-minute alerting introduces noise without value.
A few triggers:
We re-embed end-to-end about 3-4 times per year. Each is a planned event, not a panic response.
Costs and overhead:
text-embedding-3-small) or ~3 hours of GPU time for self-hosted bge-large.The marginal cost is small relative to the cost of a quiet quality regression. Worth doing.
Track top-1 retrieval score distribution daily. The single most useful drift signal.
Multiple signals, not one. Top-1 score, top-k spread, citation rate, refusal rate, user feedback. Together they detect what any one misses.
Re-embed changed documents nightly. Cheap; catches editorial updates.
Shadow index for model upgrades. Don't cut over blind.
Plan for full re-embed events. A few times a year is normal. Have the pipeline ready.
Watch refusal rate. Rising refusal rate often means the index doesn't cover current questions — a content gap, not a model gap.
Embeddings drift is a silent failure mode of every RAG system. The vectors don't change; the world does. Without monitoring it actively, you find out about quality regressions weeks after they start. With the dashboards above, days. The cost of building them is small; the cost of skipping them shows up in the wrong place.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
Explore more articles in this category
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Embeddings turn text into numbers a computer can compare. Here's the working mental model, a runnable Python example, and where embeddings fit in real apps.
Evergreen posts worth revisiting.