Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.
Defining monitoring as code: dashboards, alerts, and SLOs in Git. The patterns that survived the migration from clicked-together monitoring.
We cut our largest playbook's runtime from 14 minutes to 4 minutes. The specific changes that mattered, plus the ones that didn't.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
We tried Pulumi for a quarter and went back to Terraform. Both are real options. Why we picked one and what would change our mind.
We test infrastructure code with three layers: validation, plan review, and integration tests. The setup that catches real bugs without slowing down PRs.
We have a private module registry with ~25 modules used across 12 accounts. Versioning, interface design, and the over-modularization mistake we keep making.