A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
When we first rolled out a RAG-based assistant for our internal SRE team, nothing in the vendor docs really prepared us for the messy parts.
The first painful incident happened on a Monday morning. A runbook query returned an outdated PostgreSQL failover procedure because:
Two weeks later, we saw a spike in “no relevant context found” errors during incident calls. The vector DB was healthy; the problem turned out to be:
The marketing pages sold RAG as magic. In reality it behaves more like a database: if you don’t design for drift, invalidation, and observability, it will betray you at the worst moment.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A practical GitHub Actions monorepo CI guide built around a real scaling problem: long queues, noisy failures, and developers waiting 40 minutes for feedback.
Explore more articles in this category
Tracking experiments and shipping models are different problems. The MLOps tooling assumes one solution; production splits them. The patterns we use.
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.