A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Our worst incident of last year started with a simple question: “Why is there an EC2 instance we can't find in Terraform?”
```bash terraform plan -detailed-exitcode || echo "Drift detected" ```
Drift still happens, but on-call no longer learns about it at the worst possible moment.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Our CI was 73% green at the worst point. People trusted it less than coin flips. Six things we did to get to 96%, in rough order of impact.
Explore more articles in this category
We launched Backstage in October. Six months in, 80% of services are catalogued, on-boarding takes a third of the time, and we mostly know what owns what.
We ran Pulumi in TypeScript and Terraform in HCL side by side across 60+ services. Each won different categories of work. Here's the breakdown.
How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.
Evergreen posts worth revisiting.