We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.

On this page

Practical Guide: Infrastructure Drift Detection Workflow

About a year ago we found a security group rule in production that wasn't in our Terraform code. It had been added through the AWS console during a 2 AM incident six weeks earlier. The engineer fixing the incident did the right thing — fix first, file the followup. The followup never happened. The rule stayed.

That triggered a project to figure out, programmatically, when our actual infrastructure differs from our declared infrastructure. This post is the workflow we ended up with. It's not glamorous. It catches roughly one drift per week and we close most within a day.

The basic shape #

The workflow is a cron job that runs terraform plan against every environment every morning, in read-only mode, and reports the diff. If the plan would make any change at all, we treat that as drift — either the code has unmerged changes, or someone changed reality.

That's it. That's the entire thing. The interesting parts are the operational details that make it actually work in practice.

The cron job itself #

Our environments are split across three AWS accounts (dev, staging, prod). Each has its own Terraform configuration in a directory. The cron is a GitHub Actions workflow that runs nightly:

yaml.yaml

name: drift-detection
on:
  schedule:
    - cron: '0 6 * * *'  # 06:00 UTC daily
  workflow_dispatch:

jobs:
  drift:
    strategy:
      fail-fast: false
      matrix:
        env: [dev, staging, prod]
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@<sha>
      - uses: aws-actions/configure-aws-credentials@<sha>
        with:
          role-to-assume: arn:aws:iam::${{ vars.ACCT_ID }}:role/drift-detector
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@<sha>
      - name: terraform init
        run: terraform -chdir=environments/${{ matrix.env }} init -input=false
      - name: terraform plan
        id: plan
        run: |
          terraform -chdir=environments/${{ matrix.env }} plan \
            -lock=false -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt
          # exit 0 = no changes, 1 = error, 2 = changes
        continue-on-error: true
      - name: parse plan
        if: steps.plan.outputs.outcome == 'failure' && steps.plan.outputs.exit_code != '1'
        run: |
          terraform -chdir=environments/${{ matrix.env }} show -json tfplan \
            | jq -r '.resource_changes[] | select(.change.actions != ["no-op"]) | "\(.change.actions[0]) \(.address)"' \
            > drift-summary.txt
      - name: post to slack
        if: steps.plan.outputs.exit_code == '2'
        uses: slackapi/slack-github-action@<sha>
        with:
          channel-id: 'C012345678'
          payload: |
            {
              "text": "Drift detected in ${{ matrix.env }}",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "*Drift detected in ${{ matrix.env }}*\nLink: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
                  }
                }
              ]
            }

The role drift-detector has read-only IAM permissions. It cannot apply, only plan. The state lock is bypassed (-lock=false) because read-only plans don't need it.

The role permissions matter #

The drift-detector IAM role gets ReadOnlyAccess plus a few specific read-only services Terraform's plan needs (e.g., iam:GetRole, kms:DescribeKey). It cannot create, modify, or delete anything.

This matters because the cron runs unattended at 6 AM. If the role had write permissions and someone compromised the GitHub token, an attacker could terraform apply against production. Read-only permissions cap the damage to "they can read your infra config."

We also use OIDC federation (role-to-assume) so there are no static AWS credentials in GitHub secrets — the role is assumed via the GitHub Actions OIDC provider with a tight trust policy.

What the Slack post looks like #

The Slack message is brief on purpose. It says "drift detected in $env, link to job." It doesn't try to summarize the diff in chat — the diff is often long and full of false-positive-looking lines that need context.

The link goes to the GitHub Actions run, which has the full plan output as an artifact. Whoever is on rotation that day clicks the link, reads the plan, and either:

Files an issue ("here's the drift, owner is team X")
Opens a PR to update Terraform to match reality (if the change should stay)
Reverts the drift via Terraform apply (if the change shouldn't stay)

The triage takes about 10-15 minutes per drift event in practice. We get one drift event per week, on average.

Common drift sources we see #

After a year of running this, the patterns:

Manual tweaks during incidents. The single most common source. Engineer fixes a thing in the AWS console at 2 AM, doesn't get back to it. Caught the next morning.

Auto-generated tags. AWS adds tags via service integrations sometimes (e.g., aws:cloudformation:stack-name on resources created indirectly). These show up as drift. We add lifecycle.ignore_changes = [tags] for these specific tags on affected resources.

Drift from another tool. Crossplane or a Kubernetes operator creating resources that Terraform also manages. We've cleaned up the boundaries — Terraform owns network/IAM, Crossplane owns app-shaped things — so this is rarer now.

Default value changes in providers. The AWS provider occasionally changes defaults; resources don't change, but the plan reads them differently. Pin provider versions in versions.tf and bump deliberately.

Tags added by external services. Datadog, Vanta, AWS Backup all add their own tags. We scope ignore_changes at the module level for these.

The "ignore_changes" trap #

It's tempting to silence false-positive drift with lifecycle.ignore_changes = [...]. We do this — but carefully. Two rules:

Only ignore specific keys, never tags wholesale. Ignoring tags entirely means a malicious actor adding a tag wouldn't show up in drift.
Every ignore_changes block has an inline comment explaining why. PRs that add ignore_changes without a comment get rejected.

Example:

hcl.hcl

resource "aws_s3_bucket" "logs" {
  bucket = "company-logs-prod"

  # Datadog adds a `dd_managed=true` tag via its AWS integration.
  # The tag is informational and we don't manage it from Terraform.
  lifecycle {
    ignore_changes = [tags["dd_managed"]]
  }
}

The comment is what makes this maintainable. Six months later someone wonders why this is here; the comment answers.

What we still don't catch #

Drift in resources Terraform doesn't manage. If we never imported it into state, it could diverge wildly without ever showing up. We mitigate this with a separate "unmanaged-resource scanner" that lists S3 buckets, IAM roles, and security groups in each account and compares to what Terraform knows about. Anything in AWS but not in state gets flagged.

Drift in runtime configuration (e.g., a Lambda environment variable changed via the console). Terraform sees this as drift, but the alert is delayed by up to 24 hours. For high-sensitivity resources we have a CloudTrail-based real-time alert that fires within minutes for any console-driven change. Slower drift detection is fine for less-sensitive things.

Drift in services we can't automate. We have a handful of legacy Route53 records that were created before our Terraform setup; they're documented but not in code. We accept this for now.

How long it took to set up #

About a week of focused work for the first environment, then a couple of days each for the others. The hardest part wasn't the cron itself — it was getting our Terraform state clean enough that plan returned 0 changes when reality was actually unchanged.

When we first ran it against prod, the plan showed 47 "changes" — almost all false positives from drift the team had silently lived with. We worked through those over a sprint, either updating code to match reality or reverting reality to match code. Once we were at 0 baseline drift, the cron became useful.

If we'd skipped that cleanup phase, the daily Slack message would have been so noisy it'd have been ignored.

A note on tooling alternatives #

There are commercial drift detection tools (env0, Spacelift, Terraform Cloud's drift detection feature). They're fine; some are very good. We picked the homemade GitHub Actions version because:

We already had GitHub Actions
The total maintenance is ~1 hour per quarter
We didn't need the multi-cloud view that the commercial tools offer

If we expand to multi-cloud or want fancier triage workflows, we'd revisit. For a single-cloud Terraform shop, the homemade version is hard to beat for cost.

What I'd tell a team starting this #

Get to baseline-zero drift first, before you turn on the alerts. Otherwise the alerts are noise, the team learns to ignore them, and a real drift event slips through.

Daily is the right cadence. Hourly is too noisy. Weekly is too slow — by the time someone looks at a week-old drift, the engineer who caused it has lost the context.

Send the alert to a single dedicated Slack channel watched by the rotation, not to a general engineering channel. Drift alerts are operational, not informational.

The most underrated part of this entire workflow is the read-only IAM role. Spend the hour to set up OIDC federation and a properly-scoped role. The cron is going to run forever; the worst thing it can do should be "report drift," not "apply unauthorized changes."

Practical Guide: Infrastructure Drift Detection Workflow

Practical Guide: Infrastructure Drift Detection Workflow

The basic shape #

The cron job itself #

The role permissions matter #

What the Slack post looks like #

Common drift sources we see #

The "ignore_changes" trap #

What we still don't catch #

How long it took to set up #

A note on tooling alternatives #

What I'd tell a team starting this #

Stay Updated

Fine-tuning Large Language Models: A Practical Guide

Practical Guide: Incident Response for Platform Teams

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

Four Signals That Matter: Choosing SLIs Users Actually Feel

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas