Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
After managing infrastructure for 50+ microservices with Terraform, we've learned which module patterns scale and which become nightmares. Here's what works.
Our first approach was one massive Terraform repo with everything in it. Plan took 12 minutes. A typo in a dev variable once triggered a production change. We split it up.
We organize modules in three layers:
modules/
base/ # VPC, subnets, DNS zones
platform/ # EKS cluster, RDS, ElastiCache
service/ # Per-service: ALB, task def, IAM role
Each layer depends only on the layer below via remote state data sources:
data "terraform_remote_state" "platform" {
backend = "s3"
config = {
bucket = "terraform-state-prod"
key = "platform/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_lb_target_group" "service" {
vpc_id = data.terraform_remote_state.platform.outputs.vpc_id
# ...
}
We publish reusable modules to a private registry with semantic versioning:
module "service" {
source = "app.terraform.io/ourorg/service/aws"
version = "~> 2.0"
name = "payment-api"
environment = "production"
cpu = 512
memory = 1024
}
Rules we follow:
~> 2.0), not exactInstead of one module with 40 variables and 15 conditional blocks, we compose small modules:
module "alb" {
source = "./modules/alb"
# ...
}
module "ecs_service" {
source = "./modules/ecs-service"
target_group_arn = module.alb.target_group_arn
# ...
}
module "monitoring" {
source = "./modules/cloudwatch-alarms"
service_name = module.ecs_service.name
# ...
}
Each module does one thing. Connecting them is explicit, not hidden behind flags.
We test modules with terraform validate, tflint, and integration tests:
# In CI pipeline
cd modules/service
terraform init -backend=false
terraform validate
tflint --init
tflint
# Integration test (creates real resources, then destroys)
cd tests/
go test -v -timeout 30m ./...
Terraform at scale is a software engineering problem, not just an infrastructure problem. Treat your modules like libraries.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.
Explore more articles in this category
State drift is silent until a deploy fails or an outage reveals it. The scheduled plan-and-diff pipeline that surfaces console hotfixes and manual edits while they're still cheap to reconcile.
The "three pillars" framing misses the point — what matters is correlating across them. The patterns that earn their place and the tooling decisions that pay back.
Sharding isn't just "split the table" — the shard key choice cascades through queries, joins, rebalancing, and operations. The decisions that pay off and the ones we redid.
Evergreen posts worth revisiting.