Run retrieval-augmented generation at scale. Chunking, caching, and observability.
RAG (retrieval-augmented generation) powers many LLM apps. Here’s how to run it reliably in production.
Best practice: add metrics (latency p95, cache hit rate, cost per query) and alerts so you can iterate.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
We've shipped four production RAG applications. Each one taught us something. The end-to-end pattern that works.
Explore more articles in this category
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
Evergreen posts worth revisiting.