Building visibility into cloud costs that actually drives action. The dashboards we look at, the alerts that fire, and the queries we run.
The AWS bill is too coarse to be useful and the per-resource view is too detailed to be actionable. The middle layer — meaningful aggregations, trend detection, anomaly alerts — is where cost monitoring becomes operational. This post is the visibility layer we built: what we monitor, how, and what each signal does.
Three primary sources of cost data:
AWS Cost and Usage Reports (CUR): granular, per-resource, hourly. The most detailed source. Comes as Parquet in an S3 bucket. We query with Athena.
AWS Cost Explorer API: aggregated views, filtered by service / account / tags / etc. Good for daily / monthly summaries. We use it for dashboards.
AWS Budgets API: budget alerts and projections. We use it for thresholds.
Plus we ingest cost data into our own metrics system (Prometheus / Datadog) for cross-referencing with operational metrics. "Service X had a cost spike on day Y; was there a deploy? a traffic spike?"
A small number of dashboards beats a large number of unused ones.
Shows last 30 days of total cost vs the prior 30. Per service category (EC2, RDS, S3, etc.). Color-coded if the % change is unusual.
Glanced at Monday mornings and end-of-month. If everything is normal, takes 30 seconds. If something is off, the rest of the dashboard helps drill in.
Costs split by team tag. Each team sees their own services' costs.
Used for monthly review meetings. Each team explains anomalies in their column.
Per service: monthly cost over the last 12 months, plus baseline metrics (requests, users, traffic). Shows "is cost growing in line with usage."
If cost is growing 3x faster than usage, something is wrong. Investigate.
For features with explicit cost tracking (LLM-powered features, compute-heavy products), per-feature cost over time. Plus business metrics (revenue, user engagement) where available.
This is the most useful view for product conversations: "feature X costs $4,200/month and drives $15k of value; feature Y costs $800 and drives $20k." Frames trade-offs.
Rules of thumb for cost alerts:
We use AWS Cost Anomaly Detection for the ML-based detection. It catches things rule-based alerts miss (gradual creep, unusual patterns). False positives are real but lower than I expected; ~70% of anomaly alerts are real.
Athena queries against the CUR are the workhorse for ad-hoc analysis:
"What did we spend on EC2 in us-east-1 last month, by team?"
SELECT
resource_tags_user_team AS team,
SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE
product_code = 'AmazonEC2'
AND product_region = 'us-east-1'
AND year = '2026' AND month = '3'
GROUP BY resource_tags_user_team
ORDER BY cost DESC;
"Top 20 most expensive resources last month."
SELECT
resource_id,
product_code,
resource_tags_user_team,
SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE year = '2026' AND month = '3'
GROUP BY resource_id, product_code, resource_tags_user_team
ORDER BY cost DESC
LIMIT 20;
"Where did NAT gateway data transfer go?" (the perennial mystery cost)
SELECT
product_region,
SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE
line_item_usage_type LIKE '%NatGateway%Bytes%'
AND year = '2026' AND month = '3'
GROUP BY product_region;
We have ~15 of these queries saved as Athena bookmarks for common questions. New questions get added as they come up.
I've covered tagging in other posts but it's worth restating: cost monitoring lives or dies by tags.
For tags to be useful in cost reports:
Without this, all your cost analysis falls back to "split by service" which doesn't show team accountability.
Data transfer is the cost category that surprises teams most. Worth specific monitoring:
We have a dedicated dashboard for these. Sudden spikes in any of them indicate something to investigate:
Monthly check: what's the utilization of our reservations? AWS shows this in the Cost Explorer.
We've sometimes underbought (high utilization, lots of on-demand, opportunity to save more) and sometimes overbought (low utilization, paying for unused capacity). The monthly check catches both.
For features with measurable business outcomes:
These are calculated by combining cost data with operational metrics:
SELECT
cost.month,
cost.cost_usd,
ops.total_interactions,
cost.cost_usd / ops.total_interactions AS cost_per_interaction
FROM monthly_cost cost
JOIN monthly_operations ops ON cost.month = ops.month AND cost.feature = ops.feature
WHERE cost.feature = 'support_assistant'
ORDER BY month;
A trend up in cost-per-interaction is more meaningful than a trend up in absolute cost. Maybe interactions grew 50% (cost grew with it; healthy). Maybe cost per interaction grew (something got more expensive; investigate).
Monthly, each team gets a one-page report:
Not chargeback (we don't actually charge teams for compute). Showback (visibility into cost). The visibility alone changes behavior.
Specific things cost monitoring caught:
A new feature 100x'd costs overnight. Our chat feature got picked up by a high-volume customer; LLM costs jumped from $50/day to $5,000/day. Anomaly Detection alerted within 4 hours. We added per-customer rate limits.
A misconfigured backup retention policy. Snapshots from a service were kept indefinitely; we'd accumulated 18 months of hourly snapshots without anyone noticing. ~$1,200/month in storage. Cleanup + retention policy fixed it.
A development environment running prod-sized resources. Someone copy-pasted a Terraform module without right-sizing for dev. We caught it in the per-environment dashboard ("dev shouldn't cost more than $X").
A NAT gateway processing 50TB/month because a service was downloading models per-request instead of caching. Saved $2,000/month with VPC endpoints + caching.
Some things don't pay off to monitor:
Per-pod cost in Kubernetes. Tools exist (kubecost, OpenCost). For our scale (single-team-per-namespace mostly), the AWS-level breakdown is sufficient. For multi-tenant clusters, kubecost would help.
Hour-by-hour cost. Daily granularity is enough; hour-by-hour adds noise without insight.
Per-Lambda invocation cost in real-time. The cost-per-invocation is small; the meaningful signal is at the function level over a day.
Detailed cost forecasting. AWS's built-in forecast is fine; building our own would be wheel-reinvention.
Real numbers:
Total: ~$130/month + engineer time. Compared to the savings (catching anomalies, ongoing optimization), large positive ROI.
Tag everything; activate cost allocation tags. Without this, breakdowns are useless.
Build the per-team breakdown first. Accountability flows from visibility.
Set anomaly detection. Catches what rule-based alerts miss.
Save common Athena queries. Ad-hoc analysis becomes much faster.
Cost per outcome where you can. Better business framing than absolute cost.
Monthly review meeting. Forces ongoing attention.
Cost monitoring is one of those infrastructure pieces where the ROI is unclear until you start; once you have it, you wonder how you operated without it. Anomalies surface quickly; team accountability emerges naturally; conversations about trade-offs become data-driven. The setup cost is real but small relative to the savings it enables.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.