Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.