We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.
About a year ago we found a security group rule in production that wasn't in our Terraform code. It had been added through the AWS console during a 2 AM incident six weeks earlier. The engineer fixing the incident did the right thing — fix first, file the followup. The followup never happened. The rule stayed.
That triggered a project to figure out, programmatically, when our actual infrastructure differs from our declared infrastructure. This post is the workflow we ended up with. It's not glamorous. It catches roughly one drift per week and we close most within a day.
The workflow is a cron job that runs terraform plan against every environment every morning, in read-only mode, and reports the diff. If the plan would make any change at all, we treat that as drift — either the code has unmerged changes, or someone changed reality.
That's it. That's the entire thing. The interesting parts are the operational details that make it actually work in practice.
Our environments are split across three AWS accounts (dev, staging, prod). Each has its own Terraform configuration in a directory. The cron is a GitHub Actions workflow that runs nightly:
name: drift-detection
on:
schedule:
- cron: '0 6 * * *' # 06:00 UTC daily
workflow_dispatch:
jobs:
drift:
strategy:
fail-fast: false
matrix:
env: [dev, staging, prod]
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@<sha>
- uses: aws-actions/configure-aws-credentials@<sha>
with:
role-to-assume: arn:aws:iam::${{ vars.ACCT_ID }}:role/drift-detector
aws-region: us-east-1
- uses: hashicorp/setup-terraform@<sha>
- name: terraform init
run: terraform -chdir=environments/${{ matrix.env }} init -input=false
- name: terraform plan
id: plan
run: |
terraform -chdir=environments/${{ matrix.env }} plan \
-lock=false -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt
# exit 0 = no changes, 1 = error, 2 = changes
continue-on-error: true
- name: parse plan
if: steps.plan.outputs.outcome == 'failure' && steps.plan.outputs.exit_code != '1'
run: |
terraform -chdir=environments/${{ matrix.env }} show -json tfplan \
| jq -r '.resource_changes[] | select(.change.actions != ["no-op"]) | "\(.change.actions[0]) \(.address)"' \
> drift-summary.txt
- name: post to slack
if: steps.plan.outputs.exit_code == '2'
uses: slackapi/slack-github-action@<sha>
with:
channel-id: 'C012345678'
payload: |
{
"text": "Drift detected in ${{ matrix.env }}",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Drift detected in ${{ matrix.env }}*\nLink: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
}
}
]
}
The role drift-detector has read-only IAM permissions. It cannot apply, only plan. The state lock is bypassed (-lock=false) because read-only plans don't need it.
The drift-detector IAM role gets ReadOnlyAccess plus a few specific read-only services Terraform's plan needs (e.g., iam:GetRole, kms:DescribeKey). It cannot create, modify, or delete anything.
This matters because the cron runs unattended at 6 AM. If the role had write permissions and someone compromised the GitHub token, an attacker could terraform apply against production. Read-only permissions cap the damage to "they can read your infra config."
We also use OIDC federation (role-to-assume) so there are no static AWS credentials in GitHub secrets — the role is assumed via the GitHub Actions OIDC provider with a tight trust policy.
The Slack message is brief on purpose. It says "drift detected in $env, link to job." It doesn't try to summarize the diff in chat — the diff is often long and full of false-positive-looking lines that need context.
The link goes to the GitHub Actions run, which has the full plan output as an artifact. Whoever is on rotation that day clicks the link, reads the plan, and either:
The triage takes about 10-15 minutes per drift event in practice. We get one drift event per week, on average.
After a year of running this, the patterns:
Manual tweaks during incidents. The single most common source. Engineer fixes a thing in the AWS console at 2 AM, doesn't get back to it. Caught the next morning.
Auto-generated tags. AWS adds tags via service integrations sometimes (e.g., aws:cloudformation:stack-name on resources created indirectly). These show up as drift. We add lifecycle.ignore_changes = [tags] for these specific tags on affected resources.
Drift from another tool. Crossplane or a Kubernetes operator creating resources that Terraform also manages. We've cleaned up the boundaries — Terraform owns network/IAM, Crossplane owns app-shaped things — so this is rarer now.
Default value changes in providers. The AWS provider occasionally changes defaults; resources don't change, but the plan reads them differently. Pin provider versions in versions.tf and bump deliberately.
Tags added by external services. Datadog, Vanta, AWS Backup all add their own tags. We scope ignore_changes at the module level for these.
It's tempting to silence false-positive drift with lifecycle.ignore_changes = [...]. We do this — but carefully. Two rules:
tags wholesale. Ignoring tags entirely means a malicious actor adding a tag wouldn't show up in drift.ignore_changes block has an inline comment explaining why. PRs that add ignore_changes without a comment get rejected.Example:
resource "aws_s3_bucket" "logs" {
bucket = "company-logs-prod"
# Datadog adds a `dd_managed=true` tag via its AWS integration.
# The tag is informational and we don't manage it from Terraform.
lifecycle {
ignore_changes = [tags["dd_managed"]]
}
}
The comment is what makes this maintainable. Six months later someone wonders why this is here; the comment answers.
Drift in resources Terraform doesn't manage. If we never imported it into state, it could diverge wildly without ever showing up. We mitigate this with a separate "unmanaged-resource scanner" that lists S3 buckets, IAM roles, and security groups in each account and compares to what Terraform knows about. Anything in AWS but not in state gets flagged.
Drift in runtime configuration (e.g., a Lambda environment variable changed via the console). Terraform sees this as drift, but the alert is delayed by up to 24 hours. For high-sensitivity resources we have a CloudTrail-based real-time alert that fires within minutes for any console-driven change. Slower drift detection is fine for less-sensitive things.
Drift in services we can't automate. We have a handful of legacy Route53 records that were created before our Terraform setup; they're documented but not in code. We accept this for now.
About a week of focused work for the first environment, then a couple of days each for the others. The hardest part wasn't the cron itself — it was getting our Terraform state clean enough that plan returned 0 changes when reality was actually unchanged.
When we first ran it against prod, the plan showed 47 "changes" — almost all false positives from drift the team had silently lived with. We worked through those over a sprint, either updating code to match reality or reverting reality to match code. Once we were at 0 baseline drift, the cron became useful.
If we'd skipped that cleanup phase, the daily Slack message would have been so noisy it'd have been ignored.
There are commercial drift detection tools (env0, Spacelift, Terraform Cloud's drift detection feature). They're fine; some are very good. We picked the homemade GitHub Actions version because:
If we expand to multi-cloud or want fancier triage workflows, we'd revisit. For a single-cloud Terraform shop, the homemade version is hard to beat for cost.
Get to baseline-zero drift first, before you turn on the alerts. Otherwise the alerts are noise, the team learns to ignore them, and a real drift event slips through.
Daily is the right cadence. Hourly is too noisy. Weekly is too slow — by the time someone looks at a week-old drift, the engineer who caused it has lost the context.
Send the alert to a single dedicated Slack channel watched by the rotation, not to a general engineering channel. Drift alerts are operational, not informational.
The most underrated part of this entire workflow is the read-only IAM role. Spend the hour to set up OIDC federation and a properly-scoped role. The cron is going to run forever; the worst thing it can do should be "report drift," not "apply unauthorized changes."
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Learn how to fine-tune LLMs like Llama 2, Mistral, and GPT models for your specific use case. Includes LoRA, QLoRA, and full fine-tuning techniques.
Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
Evergreen posts worth revisiting.