A different angle on AWS cost work: the operational discipline that prevents costs from creeping back up after the initial cleanup.
Most cost-optimization advice focuses on the initial cleanup — find waste, fix it, save money. The reality is that costs creep back up if there's no ongoing discipline. We've cut our AWS bill multiple times over the years; the durable savings came from process, not one-time cleanups. This post is the operational side: the practices that keep the savings sticky.
Without discipline, AWS costs grow predictably:
Each of these is invisible day-to-day. Over a year, the bill can grow 30-50% even if usage is stable. Cleanups bring it back; without ongoing discipline, the cycle repeats.
Every resource has tags. Without tags, attribution is impossible and accountability evaporates.
Required tags on every resource:
team: which team owns thisenv: dev / staging / prodservice: which logical service this belongs tocost-center: budget categoryWe enforce via:
After the initial enforcement, untagged-resource volume drops to nearly zero. New resources get tagged at creation; existing ones got tagged in batch during a multi-month cleanup.
The cost dashboard, filtered by team, becomes the team's bill — actionable per team, not opaque "AWS bill went up."
Budget per team, refreshed quarterly. Alerts at 80% of budget. The team owner gets notified; conversation happens.
Budgets are set based on previous quarter actual + planned changes. Not arbitrary "save 10%" targets — concrete budgets tied to expected work.
When a team blows their budget, the conversation isn't punitive. It's "what changed; is it justified; what to do about it." Sometimes the budget is wrong (we didn't account for the new service). Sometimes there's a real anomaly (a misconfigured Lambda, a forgotten test environment, etc.).
The discipline isn't "stay under budget at all costs." It's "you know what you're spending; surprises require explanation."
A 30-minute meeting every month. Standing agenda:
Attendees: engineering leadership, finance, anyone whose costs are anomalous this month.
What this prevents: silent drift. A service whose cost grew 50% over 3 months gets caught at meeting #1, not after the year.
What this doesn't replace: per-team ownership. The meeting is for cross-team visibility; per-team tracking happens within teams.
Every quarter, one team does a deep cost dive on their services. Goes through:
The team that just had its quarter does the deep-dive. Findings go to the monthly review the next month.
This catches what monthly reviews don't — subtle things that don't show up as month-over-month spikes but are persistently wasteful.
For new infrastructure (new services, significant capacity changes), the PR includes a cost estimate. Standard template:
Reviewers can push back: "this is 2x what similar services cost; why?" Often there's a good reason. Sometimes there isn't and the resource gets right-sized in code review.
We use Infracost (or similar) to generate estimates from Terraform plans. The diff in the PR comments includes cost projections.
AWS provides free tools:
Compute Optimizer: looks at instance utilization over 14 days and suggests right-sizes. We act on its recommendations quarterly. It's conservative; if it says "downsize," it's almost always right.
Cost Anomaly Detection: ML-based detection of unusual cost spikes. We have it configured per team; alerts go to the team's Slack.
These are free; they should be on by default in every account. We've caught several issues this way:
Each of these would have shown up in the monthly review eventually; Anomaly Detection caught them within hours.
For baseline workloads that don't churn:
We don't go 3-year because workloads change. 1-year is the right balance.
The discipline: review reservation utilization monthly. Underutilized reservations are wasted commitment; overutilized means we should buy more.
For 90%+ utilization: buy more. For 60-90%: keep status quo; review next quarter. For < 60%: don't renew when this expires.
Spot saves 60-90% on compute. Workloads that should be on spot:
Workloads that shouldn't:
For Kubernetes specifically, Karpenter manages spot effectively. We default to spot; specific pods that need on-demand request it via tolerations.
S3 lifecycle policies on every bucket:
Per-bucket tuning based on access patterns. Some buckets keep everything in Standard (the data is always hot); some go to Glacier faster.
The discipline: every new bucket gets a lifecycle policy at creation. We have a Terraform module for "S3 bucket with standard lifecycle"; new buckets use it by default.
Quarterly: every team's services get a right-sizing review. The process:
Most services right-size cleanly. A few hit issues — usually because peak load is rare but real, or because the headroom is intentional (large async batches periodically). Document the exceptions; revisit next quarter.
Some cleanup is automated:
The alert goes to the owner (via tag); they decide what to do. After 14 days of no response, the platform team escalates.
Automating delete is risky; alert-and-escalate is safer.
For services with measurable business outcomes, we compute cost per outcome:
These metrics are way more useful than raw $/month. They normalize for usage growth, framing cost in business terms.
When cost-per-outcome trends up, that's signal. When it's stable, raw cost growth might be explained by usage growth; not a problem.
Some cost-optimization activities don't pay off for our shape:
Hand-tuning every Lambda's memory. Power Tuning gives us most of the value automatically.
Custom cost dashboards beyond the AWS-native + Infracost. The AWS Cost Explorer is good enough for trend analysis; per-team breakdowns from tags work fine.
Committing to deep multi-year reservations. Workloads change too much.
Self-hosted billing analytics. Tools like Cloudability, Vantage, etc. are nice but our scale doesn't justify another tool. We use AWS Cost Explorer + custom queries on Cost and Usage Reports.
Tag everything from day one. Without tags, none of the discipline works.
Per-team budgets, not org-wide. Accountability happens at the team level.
Monthly review, quarterly deep-dive. Both cadences matter.
Pre-deploy cost estimates. Catch over-provisioning at the PR review.
Compute Optimizer + Cost Anomaly Detection. Free, valuable, set them up.
Automate cleanup of obvious waste. Snapshots, orphans, idle resources.
Cost per outcome, not just cost. Frame for the business.
The teams I've seen succeed at sustained cost optimization treat it as ongoing discipline, not as a project. The teams that struggle do periodic cleanups, see costs rise back up, and do it again next year. The cleanup work is the easy part — establishing the discipline that prevents recurrence is where the actual value lives.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Bash patterns beyond the basics: arrays, traps, process substitution, parameter expansion. The features that earn their place when scripts grow.
I fine-tuned Llama 3 8B on a single 4090 over a weekend for a side project. Here's what worked, what cost more than expected, and what I'd do differently.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.