A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

On this page

How We Stopped Terraform Drift from Surprising On-Call

Our worst incident of last year started with a simple question: “Why is there an EC2 instance we can't find in Terraform?”

The Incident #

An engineer had manually patched a production ASG during an outage.
Months later, we scaled out and Terraform attempted to “fix” the ASG back to its old shape.
That change reduced capacity during a peak event.

Fix 1: Guardrails in Terraform #

Enabled Terraform Cloud with VCS-driven runs only.
Turned off direct access to production AWS credentials except via break-glass.

Fix 2: Regular Drift Detection #

Added a nightly job:

```bash terraform plan -detailed-exitcode || echo "Drift detected" ```

Wired the exit code into Slack with a short summary.

Fix 3: Runbook and Culture #

Wrote a runbook for “emergency changes” that includes:
- ticket link,
- time-bounded console access,
- follow-up PR to codify the change in Terraform.

Drift still happens, but on-call no longer learns about it at the worst possible moment.

How We Stopped Terraform Drift from Surprising On-Call

How We Stopped Terraform Drift from Surprising On-Call

The Incident #

Fix 1: Guardrails in Terraform #

Fix 2: Regular Drift Detection #

Fix 3: Runbook and Culture #

Stay Updated

Best Practices: Kubernetes Cluster Upgrade Strategy

Real-World RAG Incidents: Lessons from a Production Rollout

More from Infrastructure

pg_stat_statements — Postgres Query Analysis Without Guessing

Terraform Module Versioning and Shared Registries

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

pg_stat_statements — Postgres Query Analysis Without Guessing

Terraform Module Versioning and Shared Registries

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

Database Backups — Testing Restores, Not Just Taking Them

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

How We Stopped Terraform Drift from Surprising On-Call

The Incident#

Fix 1: Guardrails in Terraform#

Fix 2: Regular Drift Detection#

Fix 3: Runbook and Culture#

Stay Updated

Best Practices: Kubernetes Cluster Upgrade Strategy

Real-World RAG Incidents: Lessons from a Production Rollout

More from Infrastructure

pg_stat_statements — Postgres Query Analysis Without Guessing

Terraform Module Versioning and Shared Registries

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

The Incident #

Fix 1: Guardrails in Terraform #

Fix 2: Regular Drift Detection #

Fix 3: Runbook and Culture #