AWS bill grew 40% YoY for two years before we got serious. Tagging, scoped budgets, and a weekly review meeting did 80% of the work.

On this page

AWS Cost Control with Tagging and Budgets

For two years our AWS bill grew about 40% year over year while engineering headcount grew maybe 15%. The bill hit a number where the CFO scheduled a meeting. After that meeting we got our cost-per-engineer back to where it had been three years earlier, then held it there.

The work wasn't clever. Most of it was tagging things consistently, building budgets that made cost visible to the teams generating it, and a weekly meeting that turned the data into decisions. Below is the actual workflow.

What was wrong before #

Three things, in roughly equal measure:

We had no idea who owned what. About a third of resources had no Team tag. Cost Explorer showed the total bill but not the per-team breakdown that would let teams act on it.
Budgets existed at the account level, not the team level. The total budget alarm fired sometimes; nobody knew which team caused the spike.
Cost was a once-a-quarter conversation in finance, not a once-a-week conversation in engineering. By the time anyone looked, the cost had already been incurred for 90 days.

The cure for all three was the same shape: tag it, budget it at the right granularity, and review it on a cadence the teams could act on.

Step 1: tag everything #

The first move was to enforce a small required tag set on every resource. Five tags:

Team (slug, e.g., payments, data-platform)
Service (slug, e.g., checkout-api, etl-pipeline)
Env (prod, staging, dev)
Owner (a single email address for "who do I ping about this resource")
CostCenter (the finance code; matters for the chargeback story later)

We enforced these via:

Terraform module enforcement. Our shared modules require these as inputs. CI fails on a missing tag input.
AWS Config rules. Resources created outside Terraform get flagged. Within 24 hours, the resource gets tagged automatically with Team=untagged-found-2026-04-25 if it remains untagged. The tag tells everyone it slipped through; that team gets a Slack ping.
Cost Allocation Tags activated in Billing. The five tags above are activated as cost allocation tags. Without that, they don't show up in Cost Explorer.

Backfilling was a few weeks of work. We wrote a script that joined CloudTrail (who created the resource) with the resource's current state, inferred a likely team from the creator's email, and proposed tags. A human reviewed the batch, corrected misses, and applied. About 80% of resources got correctly tagged automatically. The rest needed a human to ask the team.

The tagging rule we now enforce: a resource without all five tags is treated as billable to a default "untagged" team that the platform team owns. Teams quickly learn it's better to tag your resources than have the platform team chase you.

Step 2: budgets at the right granularity #

Single account-level budget = useless. Per-team-per-environment budget = signal.

We created budgets in AWS Budgets:

hcl.hcl

resource "aws_budgets_budget" "team_env" {
  for_each = local.team_env_pairs  # {payments-prod, payments-staging, data-platform-prod, ...}

  name              = "${each.value.team}-${each.value.env}"
  budget_type       = "COST"
  limit_amount      = each.value.amount
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "user:Team$${each.value.team}",
      "user:Env$${each.value.env}",
    ]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type           = "FORECASTED"
    subscriber_email_addresses = [each.value.team_lead_email]
    subscriber_sns_topic_arns  = [aws_sns_topic.cost_alerts.arn]
  }
}

The budget amounts came from a quarter of historical data plus a 15% growth allowance. They get reviewed quarterly. Each team's lead is the subscriber for their team's budget alerts.

Within a month of these going live, we got our first useful alerts: "data-platform-staging is forecasted at 130% of budget." The data team knew immediately why — a backfill job had been left running. They killed it.

Step 3: the weekly review #

The single highest-leverage thing we did. Every Friday 30-minute meeting, attended by:

A representative from each engineering team
Someone from finance (for context, not to police)
The platform engineer who maintains the Cost Explorer dashboards

The meeting reviews:

Last week's spend by team, vs the prior week
Anything > 20% week-over-week change (each team explains)
Forecast for the rest of the month vs budget
Any budget alarms that fired in the past week

The meeting is short on purpose. 30 minutes, fixed agenda, no presentations. If a team is over budget, we discuss it for 5 minutes; if more is needed, that team takes it offline.

The structure is what made it work. Before this meeting, cost discussions were vague and happened at the wrong altitude. The weekly cadence kept everyone aware; the team-by-team breakdown made accountability concrete.

What we found, in order of size #

The big wins, roughly in order of dollars saved:

Idle dev/staging resources. Two reserved RDS instances in dev that nobody had used in 4 months. ~$300/month each. We started running a weekly "idle resource" report that flags untouched instances; teams either justify or terminate.

Over-provisioned EC2 instances. Several instances were sized "to be safe" three years ago and never resized. Running at 8% CPU. CloudWatch + a script identified them; right-sizing recovered ~$1.5k/month total.

Forgotten data transfer. A staging job was reading from S3 in another region. Cross-region data transfer is $0.02/GB; the job was reading ~5 TB/day. ~$3000/month for a job nobody had reviewed in a year. Moved it to same-region S3.

EBS snapshots from 2022. Hundreds of orphan snapshots. ~$400/month. A lifecycle policy now expires snapshots over 90 days.

Forgotten S3 buckets. A bucket from a 2023 experiment containing 12 TB of generated data. ~$280/month for storage of data that nobody had read in over a year. Tagging revealed it; we archived to Glacier.

Total monthly savings from these five categories alone: ~$5,400. None of them were structurally interesting; we just hadn't been looking.

The longer-tail wins #

After the easy wins, the medium-sized stuff:

Spot instances for batch jobs. Migrating our batch ETL to spot saved ~25% on that workload. We're conservative on this — only batch jobs that tolerate interruption.
Savings Plans for steady-state compute. Bought after we stabilised. ~10% savings on the predictable portion of our compute.
NAT Gateway architecture. We had per-AZ NAT Gateways processing surprisingly large volumes of cross-AZ traffic. Reducing chattiness between services in different AZs saved ~$500/month.
CloudWatch logs retention. Default is "never expire." Setting reasonable retention (30-90 days for most, 365 for security-relevant) recovered storage cost.

Each of these required a couple of hours of analysis and a Terraform PR. None were dramatic individually; together they added another ~$3k/month.

What we don't do #

Per-developer IAM users with their own tags. We use SSO and assumed roles; cost is allocated by the role, not the human.
Detailed showback to product teams. Engineering owns AWS costs in our org; the engineering teams are the cost centers. Showback to product would add political complexity for limited operational value.
Aggressive automated cost-savings actions (like auto-shutdown of idle instances). The behaviour is too easy to get wrong; we let humans make those calls in the weekly review.

What surprised us #

The behaviour change happened fastest among teams whose lead was on the budget notification email. Once cost was personal — "your team is at 95% of budget, with 8 days to go" — the team responded. Before that, cost was abstract.

Tagging compliance plateaued at about 95%, never quite hitting 100%. The remaining 5% are mostly auto-created resources from AWS services we haven't fully scripted around. We accept the gap and treat the untagged residue as platform-team cost.

What I'd tell a team starting #

Don't try to optimize before you can measure. The budget structure and the tagging are what make optimization possible. Without them, you're chasing anecdotes.

Pick five tags max. We see teams try to enforce ten or more, and adoption tanks. Five is the right size — enough to cover ownership and chargeback, few enough that engineers will remember.

The weekly meeting is the lever. Without it, the dashboards exist but nobody acts on them. With it, every week is a small forcing function for cost awareness.

The order matters: tagging first (so you can measure), budgets second (so you can alert), weekly meeting third (so you can act). Skip any layer and the next one is harder to make work.

Numbers, year-over-year #

Headcount: +15%. Revenue: +40%. AWS spend: +3%. Cost per engineer: -10%.

The compounding from staying disciplined is real. The first year of doing this was the hardest; the second year it was just "what we do on Fridays."

Best Practices: AWS Cost Control with Tagging and Budgets

AWS Cost Control with Tagging and Budgets

What was wrong before #

Step 1: tag everything #

Step 2: budgets at the right granularity #

Step 3: the weekly review #

What we found, in order of size #

The longer-tail wins #

What we don't do #

What surprised us #

What I'd tell a team starting #

Numbers, year-over-year #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

Four Signals That Matter: Choosing SLIs Users Actually Feel

NAT Gateway Costs: The Silent Line Item and How to Cut It

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas