Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
After managing infrastructure for 50+ microservices with Terraform, we've learned which module patterns scale and which become nightmares. Here's what works.
Our first approach was one massive Terraform repo with everything in it. Plan took 12 minutes. A typo in a dev variable once triggered a production change. We split it up.
We organize modules in three layers:
modules/
base/ # VPC, subnets, DNS zones
platform/ # EKS cluster, RDS, ElastiCache
service/ # Per-service: ALB, task def, IAM role
Each layer depends only on the layer below via remote state data sources:
data "terraform_remote_state" "platform" {
backend = "s3"
config = {
bucket = "terraform-state-prod"
key = "platform/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_lb_target_group" "service" {
vpc_id = data.terraform_remote_state.platform.outputs.vpc_id
# ...
}
We publish reusable modules to a private registry with semantic versioning:
module "service" {
source = "app.terraform.io/ourorg/service/aws"
version = "~> 2.0"
name = "payment-api"
environment = "production"
cpu = 512
memory = 1024
}
Rules we follow:
~> 2.0), not exactInstead of one module with 40 variables and 15 conditional blocks, we compose small modules:
module "alb" {
source = "./modules/alb"
# ...
}
module "ecs_service" {
source = "./modules/ecs-service"
target_group_arn = module.alb.target_group_arn
# ...
}
module "monitoring" {
source = "./modules/cloudwatch-alarms"
service_name = module.ecs_service.name
# ...
}
Each module does one thing. Connecting them is explicit, not hidden behind flags.
We test modules with terraform validate, tflint, and integration tests:
# In CI pipeline
cd modules/service
terraform init -backend=false
terraform validate
tflint --init
tflint
# Integration test (creates real resources, then destroys)
cd tests/
go test -v -timeout 30m ./...
Terraform at scale is a software engineering problem, not just an infrastructure problem. Treat your modules like libraries.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.