Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
How AI agents are moving from read-only copilots to autonomous automation with guardrails. Best practices for approval gates and rollback.
I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
We mapped every byte that ends up in our production containers. The map showed three places trust was implicit. Each became a control.
We expanded from one Kubernetes cluster to four across two regions. The traffic-routing layer was the hardest piece. Here's what we tried, what worked, and what we'd do again.
We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.
Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.
We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.