Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.

On this page

Embeddings Drift Detection: When "Similar Enough" Stops Being Similar

We index a few hundred thousand documents with embeddings for our RAG features. The vectors are stable — bge-large produces the same output for the same input — but the quality of retrieval drifts over time as the corpus changes, as queries change, as the world changes. We noticed a quality regression a few months back that had been growing for weeks before any metric caught it. This post is the signals we built afterwards.

What "drift" means for embeddings #

Two distinct kinds:

Index drift — the embedded documents and the live queries are increasingly different in shape. The index was built on documentation from two years ago; today's questions are about things the docs don't cover. Retrieval scores stay high (the model finds the closest match) but the closest match isn't relevant anymore. Quality drops without any individual document changing.

Embedding model drift — the underlying model has been updated or replaced. Vectors generated by the new model live in a different space than the old vectors. Comparing them gives noise.

Both are real. Both need monitoring. Index drift is the slower, sneakier one — model drift you usually know about (you deployed it).

The signals we monitor #

Five things we now track:

Top-1 retrieval score distribution. For every live query, we record the cosine similarity of the top retrieved chunk. We track this distribution daily — mean, p50, p25, p10.

A healthy index has retrieval scores clustered above a threshold (for bge-large, top-1 above ~0.65 is "we have a good match"). If the distribution shifts left over time — more queries returning matches below 0.65 — the index is getting worse for the live workload.

This is the leading indicator. Caught the quality regression in our case.

Distance between query and top-k spread. When you retrieve top-5 chunks, healthy retrieval has a clear winner (top-1 much higher than top-5). Drift shows up as a flatter distribution — top-5 is close to top-1, meaning the model isn't confident about which chunk best matches. We track score[0] - score[4] per query.

Citation rate. When the RAG-generated answer cites the retrieved chunks, what fraction of generated answers cite at least one chunk? A healthy system is around 90-95%. Drops here mean the LLM isn't finding usable context in the retrieved chunks (often because the chunks aren't relevant).

Refusal rate. What fraction of answers are "I don't have that information in the documentation"? Should be stable; a rising refusal rate often means the index doesn't cover what users are asking now.

User feedback signals. Thumbs-down rate on responses, follow-up reformulation rate (user asks the same thing differently because the first response was bad). Lagging signal, but reliable.

We chart all five on the same dashboard. Visual inspection catches drift faster than alerts on individual metrics; one number moving in isolation might be noise, but three of five moving together is signal.

What the regression looked like #

The specific incident: Top-1 retrieval scores had been drifting downward for ~6 weeks. Mean had dropped from 0.71 to 0.62. Nobody noticed because:

Per-query latency was unchanged (faster, even, because lower-scored matches happen earlier in the index)
Citation rate hadn't moved much (the LLM was still citing chunks, just bad ones)
User thumbs-down rate had crept up ~2 percentage points — visible but not yet flagged

What had changed: a new feature had launched 8 weeks earlier; users were asking questions about it; the docs hadn't been updated to reflect the launch. Existing chunks were the closest match for new queries but weren't actually relevant.

Fix: identify the queries that had low top-1 scores, group them by topic, write or improve the docs for those topics, re-embed.

The fix took a day. The detection had taken 6 weeks because we weren't watching for it. Post-incident, the dashboards above became standard.

Re-embedding cadence #

Different content has different change rates:

Static docs (anything fixed) — embed once, never re-embed unless the embedding model changes.
Frequently-edited docs (product docs, runbooks) — re-embed nightly. Cheap; catches edits quickly.
External content (curated articles) — re-embed when we re-curate, which is on whatever cadence the curation runs.

We use change-data-capture on the content store: any document that's been updated since the last embed gets re-embedded. Nightly batch job runs the diff.

For embedding model upgrades (e.g., moving from bge-base to bge-large), it's a full re-embed of everything. That's a multi-hour batch job at our scale; we run it during off-hours with a feature flag controlling which index version production reads from. Zero-downtime cutover.

The "shadow index" pattern #

When evaluating a new embedding model, we run it as a shadow index for a few weeks before cutting over. Live queries are embedded with both the production model and the candidate; both indexes are queried; both results are stored. We compare:

Recall@5 against a labeled eval set — does the candidate model retrieve the expected results more often?
Distribution of top-1 scores — does the candidate cluster more confidently?
Production answer quality (judge LLM) — does the candidate's retrieval lead to better LLM answers?

The shadow index costs extra (double the embedding API spend, double the storage) but the alternative — cut over and hope it's better — has burned us before.

After the shadow period, we cut over the production index if the candidate wins. The shadow remains running for another week as fallback, then is decommissioned.

What we don't try to detect #

A few patterns we considered and abandoned:

Anomaly detection on individual query embeddings. "Is this query unusual?" ML-based, complex to tune, generates false positives. Aggregate metrics (the five above) catch the same drift more reliably.

Per-document embedding drift. "Has this document's embedding shifted from last week?" For deterministic models, the answer is no unless the document changed. Not useful as a signal.

Real-time drift alerts. Drift is slow. Daily aggregates catch it; minute-by-minute alerting introduces noise without value.

When you should re-embed proactively #

A few triggers:

Embedding model update — always.
Large corpus change — > 20% of documents added or edited. Re-embed everything for consistency.
Significant query distribution shift — new product launch, big seasonal change. Worth a refresh.
Quality regression caught by the dashboard — investigate first; re-embedding rarely is the actual fix.

We re-embed end-to-end about 3-4 times per year. Each is a planned event, not a panic response.

Operational reality #

Costs and overhead:

Re-embedding the full corpus at our scale (~280k documents): ~$45 in OpenAI API calls (text-embedding-3-small) or ~3 hours of GPU time for self-hosted bge-large.
Nightly delta re-embed (just changed docs): typically a few hundred documents, ~$0.50/night.
Shadow index storage: roughly doubles the vector storage cost during the shadow period.
Dashboards + alerts: standard Prometheus/Grafana infrastructure.

The marginal cost is small relative to the cost of a quiet quality regression. Worth doing.

What I'd tell a team starting #

Track top-1 retrieval score distribution daily. The single most useful drift signal.

Multiple signals, not one. Top-1 score, top-k spread, citation rate, refusal rate, user feedback. Together they detect what any one misses.

Re-embed changed documents nightly. Cheap; catches editorial updates.

Shadow index for model upgrades. Don't cut over blind.

Plan for full re-embed events. A few times a year is normal. Have the pipeline ready.

Watch refusal rate. Rising refusal rate often means the index doesn't cover current questions — a content gap, not a model gap.

Embeddings drift is a silent failure mode of every RAG system. The vectors don't change; the world does. Without monitoring it actively, you find out about quality regressions weeks after they start. With the dashboards above, days. The cost of building them is small; the cost of skipping them shows up in the wrong place.

Embeddings Drift Detection — When "Similar Enough" Stops Being Similar

Embeddings Drift Detection: When "Similar Enough" Stops Being Similar

What "drift" means for embeddings #

The signals we monitor #

What the regression looked like #

Re-embedding cadence #

The "shadow index" pattern #

What we don't try to detect #

When you should re-embed proactively #

Operational reality #

What I'd tell a team starting #

Stay Updated

Job Queues — Sidekiq, Celery, BullMQ Patterns That Hold Up

CDN Cache Invalidation — Strategies That Don't Break in Production

More from AI

LLM Streaming UX — Backpressure, Cancellation, Partial Results

AI Agent Tool Design — Boundaries and Confirmations

What Are Embeddings? A Beginner's Guide with Code

LLM Streaming UX — Backpressure, Cancellation, Partial Results

AI Agent Tool Design — Boundaries and Confirmations

What Are Embeddings? A Beginner's Guide with Code

Prompt Engineering Basics — From "Help Me" to Working Prompts

Build Your First RAG App in 100 Lines of Python

Distributed Tracing with OpenTelemetry — What We Ship, What We Skip

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production