Practical articles on AI, DevOps, Cloud, Linux, and infrastructure engineering.
Building visibility into cloud costs that actually drives action. The dashboards we look at, the alerts that fire, and the queries we run.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.
Design for region failure. Active/passive and active/active, data replication, and failover testing.
We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide "promote or roll back" are where the design is.
We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.
We collect ~800GB of logs per day across our fleet. The shape of our logging stack, what we keep, what we drop, and what we'd build differently.
A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.