State drift is silent until a deploy fails or an outage reveals it. The scheduled plan-and-diff pipeline that surfaces console hotfixes and manual edits while they're still cheap to reconcile.

On this page

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

Terraform's promise is that your state file describes reality. Reality disagrees the moment someone fixes something in the console at 2am during an incident, or a different tool touches the same resource, or a provider silently mutates a default. That gap — drift — stays invisible until your next apply either reverts the emergency fix or fails in a confusing way. We built drift detection into CI so drift surfaces in hours, not at the worst possible moment.

The mechanism: refresh-only plan, parse the exit code #

terraform plan has a detailed exit code mode that's purpose-built for this:

bash.bash

terraform plan -detailed-exitcode -refresh-only
# exit 0 = no changes (state matches reality)
# exit 1 = error
# exit 2 = drift detected (state differs from reality)

-refresh-only is the important flag: it compares state against the real infrastructure without proposing to change anything from your config. It answers exactly one question — "has reality drifted from what state records?" — which is what drift detection is, distinct from "does my config differ from state" (a normal pending change).

yaml.yaml

# Scheduled drift check
on:
  schedule:
    - cron: "0 */6 * * *"   # every 6 hours

jobs:
  drift:
    steps:
      - run: terraform init -input=false
      - name: Detect drift
        id: plan
        run: |
          set +e
          terraform plan -detailed-exitcode -refresh-only -no-color > plan.txt 2>&1
          echo "exit=$?" >> "$GITHUB_OUTPUT"
      - name: Alert on drift
        if: steps.plan.outputs.exit == '2'
        run: ./notify-drift.sh plan.txt

Schedule it; don't wait for a deploy #

The whole point is to decouple detection from deployment. If you only learn about drift when someone runs apply, you learn about it at the moment it's most disruptive — mid-deploy, with a reviewer staring at a plan full of surprising reverts. A scheduled check (every few hours) catches drift while the person who caused it still remembers doing it.

Make the alert actionable, not noisy #

A raw plan diff is unreadable in an alert. Extract the resource addresses that drifted and who/what likely touched them:

bash.bash

# pull just the changed resource addresses from the plan
grep -E '^\s+# .* will be' plan.txt | sed 's/# //; s/ will be.*//'

Pair this with CloudTrail / audit-log lookups for those resources to attribute the change. "RDS parameter group prod-pg was modified by console:alice@ at 02:14" is an actionable alert. "Plan shows 47 changes" is noise people learn to ignore.

Have a reconciliation policy, not just detection #

Detecting drift is half the job. The team needs a decided answer for each drift:

Adopt it: the out-of-band change was correct — update the Terraform config to match, so the next apply doesn't revert it. (Most common for legitimate hotfixes.)
Revert it: the change was unauthorized or wrong — let Terraform restore the declared state.
Import it: a resource was created outside Terraform — terraform import it under management.

The anti-pattern is detecting drift and doing nothing, because then your next real deploy carries a pile of unrelated reverts and the reviewer can't tell intended changes from accidental ones. Drift should be reconciled to zero between deploys, so every apply plan contains only what that deploy intends.

Guardrails that reduce drift at the source #

Detection is reactive; these reduce how often drift happens at all:

Restrict console write access in production. Read-only by default; break-glass roles for emergencies that are audited and reviewed.
prevent_destroy and ignore_changes on resources legitimately mutated outside Terraform (e.g. autoscaling-managed desired counts), so they don't register as drift forever.
One owner per resource. Drift is often two tools fighting over the same resource. Decide which system owns it.

What we got out of it #

Median time-to-detect for drift went from "next deploy, days later" to under 6 hours.
The "surprise revert" class of incident — a deploy quietly undoing an emergency fix — stopped, because hotfixes get adopted into config within a day.
Plans got readable again: because drift is reconciled continuously, deploy-time plans show only the intended change instead of an archaeology of accumulated console edits.

Drift detection isn't about preventing manual changes — sometimes the 2am console fix is exactly right. It's about making sure those changes get noticed and folded back into code before Terraform and reality diverge far enough to cause an outage.

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

The mechanism: refresh-only plan, parse the exit code #

Schedule it; don't wait for a deploy #

Make the alert actionable, not noisy #

Have a reconciliation policy, not just detection #

Guardrails that reduce drift at the source #

What we got out of it #

Stay Updated

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

More from Infrastructure

Observability — Correlating Logs, Metrics, and Traces in Anger

Database Sharding — The Choices We Wish We'd Made Earlier

Postgres Logical Replication for Zero-Downtime Major Upgrades

Observability — Correlating Logs, Metrics, and Traces in Anger

Database Sharding — The Choices We Wish We'd Made Earlier

Postgres Logical Replication for Zero-Downtime Major Upgrades

pg_stat_statements — Postgres Query Analysis Without Guessing

CI Pipeline Caching That Actually Pays Off

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

The mechanism: refresh-only plan, parse the exit code#

Schedule it; don't wait for a deploy#

Make the alert actionable, not noisy#

Have a reconciliation policy, not just detection#

Guardrails that reduce drift at the source#

What we got out of it#

Stay Updated

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

More from Infrastructure

Observability — Correlating Logs, Metrics, and Traces in Anger

Database Sharding — The Choices We Wish We'd Made Earlier

Postgres Logical Replication for Zero-Downtime Major Upgrades

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

The mechanism: refresh-only plan, parse the exit code #

Schedule it; don't wait for a deploy #

Make the alert actionable, not noisy #

Have a reconciliation policy, not just detection #

Guardrails that reduce drift at the source #

What we got out of it #