A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

On this page

Real-World RAG Incidents: Lessons from a Production Rollout

When we first rolled out a RAG-based assistant for our internal SRE team, nothing in the vendor docs really prepared us for the messy parts.

Incident: Cached Wrong Answers #

The first painful incident happened on a Monday morning. A runbook query returned an outdated PostgreSQL failover procedure because:

We cached answers aggressively to save tokens.
The underlying runbook in Git had been updated over the weekend.
Our invalidation logic only watched the vector store, not the source repo.

How We Fixed It #

We changed our cache key to include the document commit hash.
We added a background job that compares Git commits against stored vectors.
We updated the runbook template to show the last updated date in the answer.

Incident: Embeddings Going Silent #

Two weeks later, we saw a spike in “no relevant context found” errors during incident calls. The vector DB was healthy; the problem turned out to be:

A new data source with HTML-heavy content.
We were chunking purely by character count.
The relevant text was split across three different chunks.

Changes We Made #

Switched to semantic + heading based chunking with overlap.
Added a metric for “chunks per query” and “distance of top-1 match”.
Logged a sample of low-quality retrievals for manual review.

Checklist for RAG in Production #

Track cache hit rate, LLM error rate, and no-context rate.
Store the retrieved chunk IDs alongside each answer.
Regularly sample answers and review them with the owning team.

The marketing pages sold RAG as magic. In reality it behaves more like a database: if you don’t design for drift, invalidation, and observability, it will betray you at the worst moment.

Real-World RAG Incidents: Lessons from a Production Rollout

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main

More from AI

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

LLM Output Validation: Schema-First Prompt Engineering Patterns

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

LLM Output Validation: Schema-First Prompt Engineering Patterns

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Embedding Quality in RAG: How We Cut Hallucinations by 60%

Prompt Engineering Patterns That Actually Work in Production

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production

Real-World RAG Incidents: Lessons from a Production Rollout

Incident: Cached Wrong Answers#

How We Fixed It#

Incident: Embeddings Going Silent#

Changes We Made#

Checklist for RAG in Production#

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

GitHub Actions Monorepo CI: How We Cut Build Times Without Breaking Main

More from AI

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

LLM Output Validation: Schema-First Prompt Engineering Patterns

Vector Database Selection: Pinecone, pgvector, Qdrant After 6 Months in Production

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Linux Performance Tuning for Containers and Kubernetes Nodes

AWS Lambda and Serverless Best Practices for Production

Incident: Cached Wrong Answers #

How We Fixed It #

Incident: Embeddings Going Silent #

Changes We Made #

Checklist for RAG in Production #