How we organize Terraform state across 12 AWS accounts and 40+ services. Backends, locking, partitioning, and the migration we got wrong twice.
We have ~40 services across 12 AWS accounts (dev, staging, prod, plus per-team accounts). Terraform manages most of the infrastructure. State management is the part of Terraform that has caused us the most operational pain — more than any specific resource type or provider quirk. This is what we've landed on after a few migrations and a couple of near-misses.
A Terraform state file is the source of truth for "what does Terraform think exists." If it disagrees with reality (because someone clicked something in the AWS console, or because the file got corrupted, or because two terraform apply runs happened concurrently), the next plan is wrong and the next apply will do the wrong thing.
Treat state files like databases:
Most Terraform pain comes from skipping at least one of those four.
We use the S3 backend with DynamoDB for locking. Setup:
terraform {
backend "s3" {
bucket = "company-tf-state-prod"
key = "services/payments/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "tf-state-locks"
encrypt = true
}
}
The S3 bucket has:
DynamoDB table is tf-state-locks with primary key LockID. Locking is per-state-file, not global. Two engineers can run apply against different state files concurrently; same state file, the second blocks.
We tried Terraform Cloud briefly. It works fine but the per-seat pricing didn't justify the additional features for our team. S3 + DynamoDB is free if you already have an AWS account.
The hardest decision is how to split state files. Too few = one apply touches half your infrastructure, blast radius is huge. Too many = endless dependencies and terraform_remote_state lookups.
Our partitioning rules:
services/payments/dev, services/payments/staging, services/payments/prod are three separate state files.shared/networking/prod, shared/iam/prod.Cross-state references use terraform_remote_state:
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "company-tf-state-prod"
key = "shared/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_security_group" "app" {
vpc_id = data.terraform_remote_state.networking.outputs.vpc_id
}
The networking state's outputs become the contract. Changing an output is a breaking change for every consumer.
We've moved this line a few times. Current rules:
Shared state (one per env):
Per-service state:
The principle: shared state contains things that change rarely (months apart) and have many consumers. Per-service state contains things that change frequently (deploys) and only the service team owns.
DynamoDB locking is non-optional. We've twice had incidents from concurrent applies (before we standardized):
Lock timeout is the default (no auto-release). If a lock is stuck (CI killed mid-apply), we manually delete the lock entry from DynamoDB. We documented this and it happens maybe once a quarter.
A state file with 5,000 resources is workable. A state file with 50,000 resources is painful — every plan/apply is slow, refresh takes minutes, errors are hard to find.
We try to keep state files under ~500 resources. If a service grows beyond that, we split it (e.g., compute and storage into separate states). Splitting is the painful operation; we plan for it before crossing the threshold rather than after.
Splitting state is the hardest Terraform operation. We've done it a dozen times; here's the procedure:
terraform state mv with the -state-out flag to move resources from old state to new state, one resource at a time:
terraform state mv -state=old.tfstate -state-out=new.tfstate \
aws_s3_bucket.data aws_s3_bucket.data
terraform plan against both states. Expect: zero changes in both. If either shows changes, the split is wrong; revert.terraform_remote_state against the new state.The "expect zero changes" check is critical. We once split a state and the new plan showed it would destroy and recreate a database — because we'd also accidentally changed an attribute during the split. Caught it in plan; never applied. Always run plan first.
We migrated from a single AWS account to multi-account about two years ago. The Terraform state migration was painful.
The plan: re-apply each module against the new account, which would create new resources. Then destroy the resources in the old account.
What actually happened: the apply against the new account, against an existing state file that referenced old-account resource IDs, did weird things. Some resources tried to update in place (changing region or account, which doesn't actually work — Terraform tried to update the ARN field and failed). Others tried to create new and got conflicts.
What we should have done:
terraform import).We did roughly the right thing, just messy and out of order. Learned: state files are tied to a backend location and a set of resource IDs. Changing both at once is a multi-step operation; never try to do it as one big apply.
terraform refresh (or terraform plan -refresh-only) re-reads actual resource state from AWS and updates the state file. By default, plan does a refresh first.
We disable refresh on plans in CI for performance (-refresh=false), then run a periodic full refresh on a schedule. This makes plan-on-PR fast (~30s vs 5min for big states) but means drift between TF and reality might lag.
The schedule: nightly job runs terraform plan -refresh-only against every state file and posts diffs to Slack. Drift surfaces within 24 hours.
Beyond refresh, we run driftctl against accounts to detect resources that exist in AWS but not in Terraform. Most drift comes from:
The driftctl report goes into a weekly review. Each item is either: import to TF, accept and ignore, or delete.
S3 versioning has saved us twice. Both times, a terraform state rm was run that shouldn't have been (the engineer thought they were removing one resource; they removed a module path that contained 30). We rolled back via:
aws s3api list-object-versions --bucket tf-state --prefix path/to/stateaws s3api copy-object --copy-source bucket/path/to/state?versionId=XXX --bucket bucket --key path/to/stateWithout versioning, both incidents would have required hours of manual state reconstruction. With versioning, ~5 minutes each.
Terraform state contains sensitive values: database passwords, API keys, etc. Anyone with read access to the state file can see them.
Our rules:
Multi-region state replication. Our state buckets are in us-east-1. If that region has a long outage, we can't run Terraform until it comes back. Acceptable risk for now; mitigation would be cross-region replication, which we haven't bothered with.
State file diff visualization. When a state file changes, we don't have a diff in the PR. We rely on terraform plan output. A nice-to-have would be a "what changed in state since last apply" view; haven't built it.
Per-service Terraform versions. All states are pinned to the same Terraform version. Upgrading TF means coordinating across all services. We've stayed disciplined about this; some teams haven't and it bit them.
Pick S3 + DynamoDB if you're on AWS. Don't fight this. It's free, it works, and it's the documented path.
Partition by service-and-environment from day one. Splitting later is painful. The instinct to "just put it all in one state for now" creates technical debt that compounds.
Enable versioning on the state bucket before you write any state to it. State recovery has saved us multiple times; it's not optional.
Always run terraform plan after a state move and expect zero changes. If plan shows changes, the move is wrong. This rule has caught dozens of mistakes.
Document the split-state procedure and run it once on staging before doing it in prod. Splitting state is a multi-step process with sharp edges. Practice helps.
State management is the unsexy part of Terraform that determines whether your team scales smoothly or constantly fights itself. Get the partitioning right and the rest gets a lot easier.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We run ~600 GitHub Actions workflow runs per day across 80 repos. The patterns that scale and the ones that hit limits we didn't expect.
We removed the corporate VPN, set up workload identity everywhere, and made every service prove who it is on every call. The actual implementation, with what worked and what we abandoned.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.