We cut our AWS bill by 38% in a quarter. The specific changes that moved the bill, ranked by impact, with what we'd do first.
A while back our AWS bill was creeping up faster than our usage. We did a quarter of focused cost work and brought it down 38%. Most of the savings came from a small number of specific changes; the rest was cumulative small wins. This is the order we'd do them in if we had to start over, with the actual dollar impact for our profile.
Pre-optimization monthly bill:
Total: meaningful 6-figure annual bill. 38% cut → significant annual savings.
In rough order of how often a given change pays off:
We worked roughly in this order. Each change had its own ROI.
We had 200+ EC2 instances. Most were sized for "what we thought peak might be." Real utilization was much lower.
The work:
Specific moves:
m5.xlarge → m6i.large (newer gen, 2x smaller, similar cost-per-perf)c5.4xlarge → c6i.2xlarge (smaller; CPU was the bottleneck and they had headroom)r5.2xlarge → r6i.xlarge (RAM bottleneck, smaller fit fine)Saving: ~$8,000/month. Largest single change.
The trick: right-sizing isn't always "go smaller." Sometimes a smaller-but-newer-generation instance is cheaper AND faster. AWS Compute Optimizer (free tool) suggests these moves automatically.
EKS nodes were on-demand. Most workloads were idempotent (could survive a node going away). We migrated:
Spot instances are 60-90% cheaper than on-demand depending on the instance type and zone availability.
For Kubernetes specifically, we use Karpenter to manage spot:
Saving: ~$4,500/month. The interruption rate is real (~2 spot interruptions per week on average) but our workloads handle it cleanly.
For the workloads that stay on-demand (stateful services, baseline web capacity), reserve them:
We don't go 3-year because our workload changes. 1-year is a reasonable balance between commitment risk and savings.
Savings Plans are flexible across instance types, so they're forgiving if we right-size or migrate workloads during the term. Reserved Instances are tied to specific instance classes.
Saving: ~$3,200/month.
This one was a surprise. NAT gateway data transfer was costing us ~$2,800/month.
Investigation: a service was downloading a 1.5GB ML model from S3 once per pod startup. The pods scaled up frequently. Multiplied by traffic volume, the NAT egress was massive.
Fix: S3 Gateway VPC Endpoint. Free. Routes S3 traffic directly without going through NAT.
We then audited other AWS service traffic patterns:
Saving: ~$2,000/month from these endpoints alone.
This is the change we'd do first next time. The ROI is great and the work is mechanical.
We had ~40TB in S3. Most of it was old.
Lifecycle policies we added:
Saving: ~$600/month. Smaller in absolute terms but the per-month-of-retention cost is large.
Things sitting around costing money:
We wrote a script that lists likely-zombies; reviewed manually; deleted what wasn't needed.
Saving: ~$400/month. Small but trivial to find and free to fix.
Same exercise as EC2 but for managed databases:
r5.4xlarge for "headroom"; actual usage suggested r5.xlarge was fine. Downsized.db.t2 instances → db.t3 or db.t4g (Graviton; cheaper).io1 (provisioned IOPS) didn't actually need that many IOPS. Switched to gp3 (general purpose, configurable IOPS but cheaper baseline).Saving: ~$1,200/month.
ElastiCache (Redis) was over-provisioned. We had cache.r5.xlarge × 6 nodes for a workload that fit comfortably on cache.r6g.large × 3 nodes.
Saving: ~$700/month.
Cross-region data transfer adds up. We audited:
Saving: ~$500/month.
A handful of smaller wins:
Total: ~$430/month from cleanup work.
A few things we tried that didn't pay off:
Switching to Graviton (ARM) instances broadly. For our specific workloads (mostly Java and Python), some saw 10-20% performance improvement on Graviton; others saw no difference or slight regressions. Net cost savings were modest, and the migration toil (fixing per-arch container images, dependencies) was real. We did it for new workloads, not as a forced migration.
Aggressive scale-to-zero with autoscaling. For very bursty workloads, scale-to-zero saves money. For workloads with steady-state baseline load, scale-to-zero just causes constant scale-up/down churn with no benefit. We're selective.
Multi-cloud price arbitrage. Tried routing some workloads to GCP for specific services where pricing seemed better. The savings on the workloads were real but small; the operational overhead of multi-cloud (which we already have for other reasons) wasn't worth chasing for cost alone.
Self-hosting things to save SaaS costs. Replacing Datadog with self-hosted Grafana + Prometheus saved meaningful $/month on the SaaS bill, but the engineer-time added up to ~the same. Net-positive but the gain is the engineer-time-when-it-matters, not the $.
After the initial cleanup, the discipline:
Without ongoing discipline, costs creep back up. The 38% cut was the easy part; staying lean is the harder ongoing work.
Look at NAT data transfer first. Often the biggest surprise. Free fix (VPC endpoints).
Right-sizing tools are free. AWS Compute Optimizer, GCP Recommender, Azure Advisor. Use them.
Spot for stateless workloads. The biggest single lever after right-sizing.
Reserved capacity once you know your baseline. Don't over-commit; 1-year is usually right.
Lifecycle policies on S3 from day one. Costs compound silently without them.
Tag everything. Without tags, attribution is impossible and accountability evaporates.
Monthly cost review. 30 minutes/month catches drift before it becomes a problem.
Cost optimization isn't sexy work but the ROI is high. The 38% cut took ~2 months of focused effort and saved 5-figure-monthly-recurring. That's better ROI than most engineering work. The discipline is in keeping the wins; without ongoing attention, the costs come back.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.