One Terraform state file per environment sounds obvious until you watch a dev plan touch a prod resource. Here's how we actually isolate state and the mistakes we made getting there.
About two years ago I worked on a team where someone ran terraform apply against staging and accidentally destroyed an RDS instance in production. The state files for both environments were in the same S3 bucket, named identically except for a directory prefix. The engineer had cd'd into the wrong folder, typed apply, and answered yes to a destroy plan that referenced what they thought was staging.
Nothing apocalyptic happened — we restored from a snapshot in 40 minutes — but the close call drove a project to lock down environment isolation properly. This post is what stuck.
The pattern most teams start with: one S3 backend, one DynamoDB lock table, separate keys per environment.
# environments/dev/main.tf
terraform {
backend "s3" {
bucket = "my-tfstate"
key = "environments/dev/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "my-tfstate-locks"
}
}
# environments/prod/main.tf — same bucket, different key
terraform {
backend "s3" {
bucket = "my-tfstate"
key = "environments/prod/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "my-tfstate-locks"
}
}
This works in the sense that terraform apply in dev/ touches the dev state file and apply in prod/ touches the prod one. The directories are separate. The locks are separate. Most workflows are fine.
What it doesn't protect against: a human with credentials valid for both environments running the wrong command in the wrong directory. The accident I described above happened with this exact setup.
Three changes to the cheap version, in order of importance:
This is the one that did the most heavy lifting. Production resources live in a production AWS account. Staging in a staging account. Dev in a dev account.
The Terraform code provider blocks have explicit account_id allowlists:
provider "aws" {
region = "us-east-1"
allowed_account_ids = ["123456789012"] # production only
}
If anyone tries to apply this configuration with credentials for a different account, Terraform refuses before doing anything. This single check would have prevented the original incident: the engineer had no way to even authenticate to prod from the staging directory.
Each account hosts its own Terraform state bucket. Staging's bucket lives in the staging account, prod's in the prod account. The bucket policies allow access only from within the same account, full stop.
The directory structure looks like this:
infra/
├── modules/
│ ├── networking/
│ ├── eks/
│ └── rds/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── backend.tf # points at dev account's bucket
│ │ └── provider.tf # allowed_account_ids = dev account
│ ├── staging/
│ │ ├── main.tf
│ │ ├── backend.tf
│ │ └── provider.tf
│ └── prod/
│ ├── main.tf
│ ├── backend.tf
│ └── provider.tf
The shared modules in modules/ are environment-agnostic. The environment-specific composition lives under each environments/{env}/.
Engineers don't have static AWS keys. They authenticate to our SSO provider, which lets them assume a role in the appropriate account. The role for dev allows broad access; the role for staging is more restricted; the role for prod is restricted further and limited to a small group.
# what an engineer types day-to-day
aws sso login
aws --profile prod s3 ls # this only works for the production team
cd infra/environments/prod
terraform plan # uses the assumed prod role
The role for prod requires MFA on assumption, with a 1-hour session lifetime. Plan-only operations are allowed broadly; apply operations require an additional explicit approval policy that Terraform Cloud (or in our case, a homemade equivalent) enforces.
A few real issues during the migration:
Cross-environment data references stopped working. Some of our Terraform code had data "aws_ssm_parameter" lookups that crossed account boundaries — staging reading a value from a prod-account parameter store. When we split accounts, those lookups failed. The fix was to copy the values into each environment's parameter store, or use terraform_remote_state with explicit cross-account read permissions where unavoidable.
Karpenter / Crossplane-style "operators that create infrastructure" got complicated. We had a controller running in dev that was supposed to provision dev-account resources. Splitting accounts meant the controller needed cross-account IAM roles. Fixable, but not free.
Module-internal aws_caller_identity lookups assumed a single account. A few modules had data "aws_caller_identity" "current" and used the result to construct ARNs. When the modules were used in different accounts, the ARNs were correctly different — usually fine, occasionally not. We audited these and made the assumptions explicit (passing account IDs as variables instead of inferring them).
People hitting the team with these patterns, asking for help:
terraform workspace) share a backend. If you're using terraform workspace select prod, your prod state lives in the same place as your dev state. We don't allow this for environments. Workspaces are fine for ephemeral previews; they're not isolation.random_password resources stored in state.terraform import cycle, not a rewind.Multi-account adds friction. Logging is split across accounts (we use a centralised CloudTrail aggregation in a shared "audit" account). Cost reports are per account (Cost Explorer with a payer account). Some monitoring tools need to be configured per account.
The friction is real. Worth it. Once a year there's an incident in a peer org where someone trashes prod from a dev terminal, and we get to say "that can't happen here." That's worth quite a lot of friction.
If you can use AWS Organizations and separate accounts: do it from day one. Refactoring later is genuinely painful (we did it; it took a quarter).
If you can't: at minimum, separate state buckets, separate IAM roles per environment, the allowed_account_ids provider check on every config, and a written rule that "apply" against prod requires a second human to be on a call. The rule is the lever; the tooling enforces it.
The only Terraform structure I've seen reliably prevent the kind of incident that started this story is account-level isolation. Anything weaker has worked sometimes and failed sometimes. Account isolation has, for us, always worked.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.