A different angle on AWS cost work: the operational discipline that prevents costs from creeping back up after the initial cleanup.

On this page

AWS Cost Optimization: Discipline That Sticks

Most cost-optimization advice focuses on the initial cleanup — find waste, fix it, save money. The reality is that costs creep back up if there's no ongoing discipline. We've cut our AWS bill multiple times over the years; the durable savings came from process, not one-time cleanups. This post is the operational side: the practices that keep the savings sticky.

The pattern of cost growth #

Without discipline, AWS costs grow predictably:

New services get over-provisioned "to be safe"
Reserved capacity expires; gets renewed at higher levels because consumption grew
Old resources accumulate (snapshots, EBS volumes from terminated instances, idle ELBs)
New AWS services get adopted without cost analysis
Engineers default to bigger instance types because tuning is work

Each of these is invisible day-to-day. Over a year, the bill can grow 30-50% even if usage is stable. Cleanups bring it back; without ongoing discipline, the cycle repeats.

Tagging: the foundation #

Every resource has tags. Without tags, attribution is impossible and accountability evaporates.

Required tags on every resource:

team: which team owns this
env: dev / staging / prod
service: which logical service this belongs to
cost-center: budget category

We enforce via:

Tag policies (organization-level): some tags are required for resource creation
Cost allocation tags activated for these in the billing dashboard
A monthly script that flags untagged resources for owners

After the initial enforcement, untagged-resource volume drops to nearly zero. New resources get tagged at creation; existing ones got tagged in batch during a multi-month cleanup.

The cost dashboard, filtered by team, becomes the team's bill — actionable per team, not opaque "AWS bill went up."

Per-team budgets and alerts #

Budget per team, refreshed quarterly. Alerts at 80% of budget. The team owner gets notified; conversation happens.

Budgets are set based on previous quarter actual + planned changes. Not arbitrary "save 10%" targets — concrete budgets tied to expected work.

When a team blows their budget, the conversation isn't punitive. It's "what changed; is it justified; what to do about it." Sometimes the budget is wrong (we didn't account for the new service). Sometimes there's a real anomaly (a misconfigured Lambda, a forgotten test environment, etc.).

The discipline isn't "stay under budget at all costs." It's "you know what you're spending; surprises require explanation."

Monthly cost review #

A 30-minute meeting every month. Standing agenda:

Total cost vs last month / vs same-month-last-year
Top 5 movers (services where cost changed materially)
Projects in progress that are spending more than expected
Action items from last month — status

Attendees: engineering leadership, finance, anyone whose costs are anomalous this month.

What this prevents: silent drift. A service whose cost grew 50% over 3 months gets caught at meeting #1, not after the year.

What this doesn't replace: per-team ownership. The meeting is for cross-team visibility; per-team tracking happens within teams.

Quarterly deep-dive #

Every quarter, one team does a deep cost dive on their services. Goes through:

Right-sizing: are instances appropriately sized?
Spot opportunities: any workload that could move to spot?
Storage tiering: any S3 paths that should be on Glacier?
Reservations: anything that could be reserved?
Zombie cleanup: snapshots, volumes, ELBs nobody uses?

The team that just had its quarter does the deep-dive. Findings go to the monthly review the next month.

This catches what monthly reviews don't — subtle things that don't show up as month-over-month spikes but are persistently wasteful.

Pre-deploy cost estimates #

For new infrastructure (new services, significant capacity changes), the PR includes a cost estimate. Standard template:

Estimated monthly cost at expected baseline
Estimated cost at peak / 3x baseline
Reserved-vs-on-demand decision and reasoning
Comparison with similar existing services

Reviewers can push back: "this is 2x what similar services cost; why?" Often there's a good reason. Sometimes there isn't and the resource gets right-sized in code review.

We use Infracost (or similar) to generate estimates from Terraform plans. The diff in the PR comments includes cost projections.

Compute Optimizer / Cost Anomaly Detection #

AWS provides free tools:

Compute Optimizer: looks at instance utilization over 14 days and suggests right-sizes. We act on its recommendations quarterly. It's conservative; if it says "downsize," it's almost always right.

Cost Anomaly Detection: ML-based detection of unusual cost spikes. We have it configured per team; alerts go to the team's Slack.

These are free; they should be on by default in every account. We've caught several issues this way:

A service that started consuming 10x more compute after a deploy (memory leak retrying expensive operations)
A new feature launch that drove S3 PUT costs higher than expected (small writes; should batch)
A misconfigured Lambda function that was invoking itself recursively

Each of these would have shown up in the monthly review eventually; Anomaly Detection caught them within hours.

Reservations and Savings Plans #

For baseline workloads that don't churn:

1-year compute Savings Plan covering ~70% of baseline EC2 spend
1-year RDS Reserved Instances for prod databases
Compute Savings Plans (more flexible than EC2 RIs) preferred where applicable

We don't go 3-year because workloads change. 1-year is the right balance.

The discipline: review reservation utilization monthly. Underutilized reservations are wasted commitment; overutilized means we should buy more.

For 90%+ utilization: buy more. For 60-90%: keep status quo; review next quarter. For < 60%: don't renew when this expires.

Spot for tolerant workloads #

Spot saves 60-90% on compute. Workloads that should be on spot:

Batch jobs (idempotent; retry on interruption)
CI runners (reschedule on interruption)
Stateless web servers (mixed with on-demand for baseline reliability)
Kubernetes worker pools where pods can move

Workloads that shouldn't:

Stateful services (databases, message queues without HA)
Long-running jobs that can't checkpoint
Anything where interruption causes cascading issues

For Kubernetes specifically, Karpenter manages spot effectively. We default to spot; specific pods that need on-demand request it via tolerations.

Storage tiering #

S3 lifecycle policies on every bucket:

Standard for active data (< 30 days)
Standard-IA after 30 days
Glacier Instant Retrieval after 90 days
Glacier Deep Archive after 1 year

Per-bucket tuning based on access patterns. Some buckets keep everything in Standard (the data is always hot); some go to Glacier faster.

The discipline: every new bucket gets a lifecycle policy at creation. We have a Terraform module for "S3 bucket with standard lifecycle"; new buckets use it by default.

Right-sizing on a schedule #

Quarterly: every team's services get a right-sizing review. The process:

Pull CloudWatch metrics for the past 30 days (CPU, memory, network)
Compute p95 utilization
If p95 < 30%: candidate for right-sizing
Test: what would performance look like one size smaller? (Often: fine.)
Right-size in code; deploy; monitor.

Most services right-size cleanly. A few hit issues — usually because peak load is rare but real, or because the headroom is intentional (large async batches periodically). Document the exceptions; revisit next quarter.

Cleanup automation #

Some cleanup is automated:

EBS snapshots older than retention policy: auto-deleted
Unattached EBS volumes older than 30 days: alert (we don't auto-delete; they sometimes hold data)
ELBs with no targets for 7 days: alert
EC2 instances older than 90 days without recent activity: alert

The alert goes to the owner (via tag); they decide what to do. After 14 days of no response, the platform team escalates.

Automating delete is risky; alert-and-escalate is safer.

Cost-of-quality tracking #

For services with measurable business outcomes, we compute cost per outcome:

LLM-powered features: cost per successful interaction
Compute-heavy features: cost per processing unit
Storage-heavy features: cost per data unit retained

These metrics are way more useful than raw $/month. They normalize for usage growth, framing cost in business terms.

When cost-per-outcome trends up, that's signal. When it's stable, raw cost growth might be explained by usage growth; not a problem.

What we don't bother with #

Some cost-optimization activities don't pay off for our shape:

Hand-tuning every Lambda's memory. Power Tuning gives us most of the value automatically.

Custom cost dashboards beyond the AWS-native + Infracost. The AWS Cost Explorer is good enough for trend analysis; per-team breakdowns from tags work fine.

Committing to deep multi-year reservations. Workloads change too much.

Self-hosted billing analytics. Tools like Cloudability, Vantage, etc. are nice but our scale doesn't justify another tool. We use AWS Cost Explorer + custom queries on Cost and Usage Reports.

What I'd tell a team starting #

Tag everything from day one. Without tags, none of the discipline works.

Per-team budgets, not org-wide. Accountability happens at the team level.

Monthly review, quarterly deep-dive. Both cadences matter.

Pre-deploy cost estimates. Catch over-provisioning at the PR review.

Compute Optimizer + Cost Anomaly Detection. Free, valuable, set them up.

Automate cleanup of obvious waste. Snapshots, orphans, idle resources.

Cost per outcome, not just cost. Frame for the business.

The teams I've seen succeed at sustained cost optimization treat it as ongoing discipline, not as a project. The teams that struggle do periodic cleanups, see costs rise back up, and do it again next year. The cleanup work is the easy part — establishing the discipline that prevents recurrence is where the actual value lives.

AWS Cost Optimization Strategies

AWS Cost Optimization: Discipline That Sticks

The pattern of cost growth #

Tagging: the foundation #

Per-team budgets and alerts #

Monthly cost review #

Quarterly deep-dive #

Pre-deploy cost estimates #

Compute Optimizer / Cost Anomaly Detection #

Reservations and Savings Plans #

Spot for tolerant workloads #

Storage tiering #

Right-sizing on a schedule #

Cleanup automation #

Cost-of-quality tracking #

What we don't bother with #

What I'd tell a team starting #

Stay Updated

Advanced Bash Scripting Techniques

Fine-tuning Llama 3 on Consumer Hardware

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

Four Signals That Matter: Choosing SLIs Users Actually Feel

NAT Gateway Costs: The Silent Line Item and How to Cut It

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025