A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Our worst incident of last year started with a simple question: “Why is there an EC2 instance we can't find in Terraform?”
```bash terraform plan -detailed-exitcode || echo "Drift detected" ```
Drift still happens, but on-call no longer learns about it at the worst possible moment.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Write Ansible playbooks that are idempotent, readable, and maintainable for config management.
How to implement Backstage with real templates, scorecards, and golden paths so internal platform work reduces delivery friction.