A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

On this page

Real-World RAG Incidents: Lessons from a Production Rollout

When we first rolled out a RAG-based assistant for our internal SRE team, nothing in the vendor docs really prepared us for the messy parts.

Incident: Cached Wrong Answers #

The first painful incident happened on a Monday morning. A runbook query returned an outdated PostgreSQL failover procedure because:

We cached answers aggressively to save tokens.
The underlying runbook in Git had been updated over the weekend.
Our invalidation logic only watched the vector store, not the source repo.

How We Fixed It #

We changed our cache key to include the document commit hash.
We added a background job that compares Git commits against stored vectors.
We updated the runbook template to show the last updated date in the answer.

Incident: Embeddings Going Silent #

Two weeks later, we saw a spike in “no relevant context found” errors during incident calls. The vector DB was healthy; the problem turned out to be:

A new data source with HTML-heavy content.
We were chunking purely by character count.
The relevant text was split across three different chunks.

Changes We Made #

Switched to semantic + heading based chunking with overlap.
Added a metric for “chunks per query” and “distance of top-1 match”.
Logged a sample of low-quality retrievals for manual review.

Checklist for RAG in Production #

Track cache hit rate, LLM error rate, and no-context rate.
Store the retrieved chunk IDs alongside each answer.
Regularly sample answers and review them with the owning team.

The marketing pages sold RAG as magic. In reality it behaves more like a database: if you don’t design for drift, invalidation, and observability, it will betray you at the worst moment.

Real-World RAG Incidents: Lessons from a Production Rollout

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #

Stay Updated

Embedding Models Comparison: Choosing the Right Model for Your Use Case

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

More from AI

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

Multi-Provider LLM Routing — Failover, Cost Routing, and Load Balancing

Hybrid Search — Combining BM25 and Embeddings for Better RAG

LLM Streaming UX — Backpressure, Cancellation, Partial Results

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

Terraform Cloud Cost Controls: Budgets, Policies, and Tagging

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers#

How We Fixed It#

Incident: Embeddings Going Silent#

Changes We Made#

Checklist for RAG in Production#

Stay Updated

Embedding Models Comparison: Choosing the Right Model for Your Use Case

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

More from AI

MLOps — Model Registry vs MLflow Tracking, And When You Need Both

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

LLM Evals That Actually Predict Production Quality

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

Terraform Cloud Cost Controls: Budgets, Policies, and Tagging

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #