State drift is silent until a deploy fails or an outage reveals it. The scheduled plan-and-diff pipeline that surfaces console hotfixes and manual edits while they're still cheap to reconcile.
Terraform's promise is that your state file describes reality. Reality disagrees the moment someone fixes something in the console at 2am during an incident, or a different tool touches the same resource, or a provider silently mutates a default. That gap — drift — stays invisible until your next apply either reverts the emergency fix or fails in a confusing way. We built drift detection into CI so drift surfaces in hours, not at the worst possible moment.
terraform plan has a detailed exit code mode that's purpose-built for this:
terraform plan -detailed-exitcode -refresh-only
# exit 0 = no changes (state matches reality)
# exit 1 = error
# exit 2 = drift detected (state differs from reality)
-refresh-only is the important flag: it compares state against the real infrastructure without proposing to change anything from your config. It answers exactly one question — "has reality drifted from what state records?" — which is what drift detection is, distinct from "does my config differ from state" (a normal pending change).
# Scheduled drift check
on:
schedule:
- cron: "0 */6 * * *" # every 6 hours
jobs:
drift:
steps:
- run: terraform init -input=false
- name: Detect drift
id: plan
run: |
set +e
terraform plan -detailed-exitcode -refresh-only -no-color > plan.txt 2>&1
echo "exit=$?" >> "$GITHUB_OUTPUT"
- name: Alert on drift
if: steps.plan.outputs.exit == '2'
run: ./notify-drift.sh plan.txt
The whole point is to decouple detection from deployment. If you only learn about drift when someone runs apply, you learn about it at the moment it's most disruptive — mid-deploy, with a reviewer staring at a plan full of surprising reverts. A scheduled check (every few hours) catches drift while the person who caused it still remembers doing it.
A raw plan diff is unreadable in an alert. Extract the resource addresses that drifted and who/what likely touched them:
# pull just the changed resource addresses from the plan
grep -E '^\s+# .* will be' plan.txt | sed 's/# //; s/ will be.*//'
Pair this with CloudTrail / audit-log lookups for those resources to attribute the change. "RDS parameter group prod-pg was modified by console:alice@ at 02:14" is an actionable alert. "Plan shows 47 changes" is noise people learn to ignore.
Detecting drift is half the job. The team needs a decided answer for each drift:
terraform import it under management.The anti-pattern is detecting drift and doing nothing, because then your next real deploy carries a pile of unrelated reverts and the reviewer can't tell intended changes from accidental ones. Drift should be reconciled to zero between deploys, so every apply plan contains only what that deploy intends.
Detection is reactive; these reduce how often drift happens at all:
prevent_destroy and ignore_changes on resources legitimately mutated outside Terraform (e.g. autoscaling-managed desired counts), so they don't register as drift forever.Drift detection isn't about preventing manual changes — sometimes the 2am console fix is exactly right. It's about making sure those changes get noticed and folded back into code before Terraform and reality diverge far enough to cause an outage.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
Node upgrades, autoscaler scale-downs, and spot reclaims all drain nodes. Without PDBs they can take all your replicas at once. The budgets, probes, and graceful-shutdown handling that keep voluntary disruptions invisible to users.
Explore more articles in this category
The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.
Sharding isn't just "split the table" — the shard key choice cascades through queries, joins, rebalancing, and operations. The decisions that pay off and the ones we redid.
pg_upgrade is fast but takes downtime; logical replication lets you cut over while the old DB still serves traffic. The runbook, the gotchas, and the post-cutover checklist.
Evergreen posts worth revisiting.