AWS bill grew 40% YoY for two years before we got serious. Tagging, scoped budgets, and a weekly review meeting did 80% of the work.
For two years our AWS bill grew about 40% year over year while engineering headcount grew maybe 15%. The bill hit a number where the CFO scheduled a meeting. After that meeting we got our cost-per-engineer back to where it had been three years earlier, then held it there.
The work wasn't clever. Most of it was tagging things consistently, building budgets that made cost visible to the teams generating it, and a weekly meeting that turned the data into decisions. Below is the actual workflow.
Three things, in roughly equal measure:
Team tag. Cost Explorer showed the total bill but not the per-team breakdown that would let teams act on it.The cure for all three was the same shape: tag it, budget it at the right granularity, and review it on a cadence the teams could act on.
The first move was to enforce a small required tag set on every resource. Five tags:
Team (slug, e.g., payments, data-platform)Service (slug, e.g., checkout-api, etl-pipeline)Env (prod, staging, dev)Owner (a single email address for "who do I ping about this resource")CostCenter (the finance code; matters for the chargeback story later)We enforced these via:
Team=untagged-found-2026-04-25 if it remains untagged. The tag tells everyone it slipped through; that team gets a Slack ping.Backfilling was a few weeks of work. We wrote a script that joined CloudTrail (who created the resource) with the resource's current state, inferred a likely team from the creator's email, and proposed tags. A human reviewed the batch, corrected misses, and applied. About 80% of resources got correctly tagged automatically. The rest needed a human to ask the team.
The tagging rule we now enforce: a resource without all five tags is treated as billable to a default "untagged" team that the platform team owns. Teams quickly learn it's better to tag your resources than have the platform team chase you.
Single account-level budget = useless. Per-team-per-environment budget = signal.
We created budgets in AWS Budgets:
resource "aws_budgets_budget" "team_env" {
for_each = local.team_env_pairs # {payments-prod, payments-staging, data-platform-prod, ...}
name = "${each.value.team}-${each.value.env}"
budget_type = "COST"
limit_amount = each.value.amount
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "TagKeyValue"
values = [
"user:Team$${each.value.team}",
"user:Env$${each.value.env}",
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = [each.value.team_lead_email]
subscriber_sns_topic_arns = [aws_sns_topic.cost_alerts.arn]
}
}
The budget amounts came from a quarter of historical data plus a 15% growth allowance. They get reviewed quarterly. Each team's lead is the subscriber for their team's budget alerts.
Within a month of these going live, we got our first useful alerts: "data-platform-staging is forecasted at 130% of budget." The data team knew immediately why — a backfill job had been left running. They killed it.
The single highest-leverage thing we did. Every Friday 30-minute meeting, attended by:
The meeting reviews:
The meeting is short on purpose. 30 minutes, fixed agenda, no presentations. If a team is over budget, we discuss it for 5 minutes; if more is needed, that team takes it offline.
The structure is what made it work. Before this meeting, cost discussions were vague and happened at the wrong altitude. The weekly cadence kept everyone aware; the team-by-team breakdown made accountability concrete.
The big wins, roughly in order of dollars saved:
Idle dev/staging resources. Two reserved RDS instances in dev that nobody had used in 4 months. ~$300/month each. We started running a weekly "idle resource" report that flags untouched instances; teams either justify or terminate.
Over-provisioned EC2 instances. Several instances were sized "to be safe" three years ago and never resized. Running at 8% CPU. CloudWatch + a script identified them; right-sizing recovered ~$1.5k/month total.
Forgotten data transfer. A staging job was reading from S3 in another region. Cross-region data transfer is $0.02/GB; the job was reading ~5 TB/day. ~$3000/month for a job nobody had reviewed in a year. Moved it to same-region S3.
EBS snapshots from 2022. Hundreds of orphan snapshots. ~$400/month. A lifecycle policy now expires snapshots over 90 days.
Forgotten S3 buckets. A bucket from a 2023 experiment containing 12 TB of generated data. ~$280/month for storage of data that nobody had read in over a year. Tagging revealed it; we archived to Glacier.
Total monthly savings from these five categories alone: ~$5,400. None of them were structurally interesting; we just hadn't been looking.
After the easy wins, the medium-sized stuff:
Each of these required a couple of hours of analysis and a Terraform PR. None were dramatic individually; together they added another ~$3k/month.
The behaviour change happened fastest among teams whose lead was on the budget notification email. Once cost was personal — "your team is at 95% of budget, with 8 days to go" — the team responded. Before that, cost was abstract.
Tagging compliance plateaued at about 95%, never quite hitting 100%. The remaining 5% are mostly auto-created resources from AWS services we haven't fully scripted around. We accept the gap and treat the untagged residue as platform-team cost.
Don't try to optimize before you can measure. The budget structure and the tagging are what make optimization possible. Without them, you're chasing anecdotes.
Pick five tags max. We see teams try to enforce ten or more, and adoption tanks. Five is the right size — enough to cover ownership and chargeback, few enough that engineers will remember.
The weekly meeting is the lever. Without it, the dashboards exist but nobody acts on them. With it, every week is a small forcing function for cost awareness.
The order matters: tagging first (so you can measure), budgets second (so you can alert), weekly meeting third (so you can act). Skip any layer and the next one is harder to make work.
Headcount: +15%. Revenue: +40%. AWS spend: +3%. Cost per engineer: -10%.
The compounding from staying disciplined is real. The first year of doing this was the hardest; the second year it was just "what we do on Fridays."
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
Evergreen posts worth revisiting.