A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.
After our AWS bill crossed $18,000/month for a 15-person startup, we did a proper audit. We found $6,200 in monthly waste. Here's every item.
Three ALBs were still running from decommissioned staging environments. Each costs ~$16/month base plus LCU charges.
Fix: We added a Terraform lifecycle check that tags ALBs with the owning team and a TTL. A weekly Lambda deletes anything past its TTL with zero healthy targets.
Our production database was on db.r6g.2xlarge. CloudWatch showed average CPU at 12% and memory at 35%.
Fix: Downgraded to db.r6g.large during a maintenance window. Set up a CloudWatch alarm for CPU > 70% so we'll know when to scale back up.
14 EBS volumes were sitting with status "available"—leftovers from terminated EC2 instances.
Fix: Scripted a check:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
--output table
Snapshot anything older than 30 days, then delete.
We had 2,400 EBS snapshots going back 3 years. Most were from AMIs we no longer use.
Fix: Implemented AWS Data Lifecycle Manager with a 90-day retention policy.
Our NAT Gateway was processing 800GB/month. Much of it was S3 traffic from private subnets.
Fix: Added a VPC Gateway Endpoint for S3. Free, and it cut NAT traffic by 60%.
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
route_table_ids = [aws_route_table.private.id]
}
Every Lambda was set to 1024MB by default. AWS Power Tuning showed most needed 256MB.
Fix: Ran Power Tuning on our top 10 functions and right-sized them.
We were paying on-demand for 4 EC2 instances that had been running for 2 years.
Fix: Purchased 1-year no-upfront reserved instances for predictable workloads.
The $6,200/month we saved required about 8 hours of work. That's an annualized return of $74,400 for one day of effort.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real walkthrough of shrinking bloated Docker images from 1.2GB to 240MB using multi-stage builds, Alpine, and dependency auditing.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
Evergreen posts worth revisiting.