Building visibility into cloud costs that actually drives action. The dashboards we look at, the alerts that fire, and the queries we run.

On this page

Cloud Cost Monitoring: From Bills to Actionable Signals

The AWS bill is too coarse to be useful and the per-resource view is too detailed to be actionable. The middle layer — meaningful aggregations, trend detection, anomaly alerts — is where cost monitoring becomes operational. This post is the visibility layer we built: what we monitor, how, and what each signal does.

The data sources #

Three primary sources of cost data:

AWS Cost and Usage Reports (CUR): granular, per-resource, hourly. The most detailed source. Comes as Parquet in an S3 bucket. We query with Athena.

AWS Cost Explorer API: aggregated views, filtered by service / account / tags / etc. Good for daily / monthly summaries. We use it for dashboards.

AWS Budgets API: budget alerts and projections. We use it for thresholds.

Plus we ingest cost data into our own metrics system (Prometheus / Datadog) for cross-referencing with operational metrics. "Service X had a cost spike on day Y; was there a deploy? a traffic spike?"

The dashboards we actually look at #

A small number of dashboards beats a large number of unused ones.

The "is anything weird" dashboard #

Shows last 30 days of total cost vs the prior 30. Per service category (EC2, RDS, S3, etc.). Color-coded if the % change is unusual.

Glanced at Monday mornings and end-of-month. If everything is normal, takes 30 seconds. If something is off, the rest of the dashboard helps drill in.

The per-team breakdown #

Costs split by team tag. Each team sees their own services' costs.

Used for monthly review meetings. Each team explains anomalies in their column.

The per-service trend #

Per service: monthly cost over the last 12 months, plus baseline metrics (requests, users, traffic). Shows "is cost growing in line with usage."

If cost is growing 3x faster than usage, something is wrong. Investigate.

Cost by feature #

For features with explicit cost tracking (LLM-powered features, compute-heavy products), per-feature cost over time. Plus business metrics (revenue, user engagement) where available.

This is the most useful view for product conversations: "feature X costs $4,200/month and drives $15k of value; feature Y costs $800 and drives $20k." Frames trade-offs.

Alerts that fire #

Rules of thumb for cost alerts:

Daily total cost > 130% of 30-day rolling average → page (might be paging on cost runaway, like a bad config that's burning money fast)
Monthly cost projected to exceed budget by > 20% → ticket
Per-team monthly cost > 110% of last month → notify team
Specific cost categories (NAT data transfer, etc.) growing > 50% week-over-week → notify platform team

We use AWS Cost Anomaly Detection for the ML-based detection. It catches things rule-based alerts miss (gradual creep, unusual patterns). False positives are real but lower than I expected; ~70% of anomaly alerts are real.

Specific queries we run #

Athena queries against the CUR are the workhorse for ad-hoc analysis:

"What did we spend on EC2 in us-east-1 last month, by team?"

sql.sql

SELECT
  resource_tags_user_team AS team,
  SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE 
  product_code = 'AmazonEC2'
  AND product_region = 'us-east-1'
  AND year = '2026' AND month = '3'
GROUP BY resource_tags_user_team
ORDER BY cost DESC;

"Top 20 most expensive resources last month."

sql.sql

SELECT
  resource_id,
  product_code,
  resource_tags_user_team,
  SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE year = '2026' AND month = '3'
GROUP BY resource_id, product_code, resource_tags_user_team
ORDER BY cost DESC
LIMIT 20;

"Where did NAT gateway data transfer go?" (the perennial mystery cost)

sql.sql

SELECT
  product_region,
  SUM(line_item_unblended_cost) AS cost
FROM cur_data
WHERE 
  line_item_usage_type LIKE '%NatGateway%Bytes%'
  AND year = '2026' AND month = '3'
GROUP BY product_region;

We have ~15 of these queries saved as Athena bookmarks for common questions. New questions get added as they come up.

Tagging discipline (again)#

I've covered tagging in other posts but it's worth restating: cost monitoring lives or dies by tags.

For tags to be useful in cost reports:

Activate them as cost allocation tags (one-time AWS Console step)
Enforce via IAM / SCP / Tag Policies that all resources have required tags
Backfill tags on existing untagged resources (one-time project)
Audit periodically for tag drift

Without this, all your cost analysis falls back to "split by service" which doesn't show team accountability.

Special focus: data transfer #

Data transfer is the cost category that surprises teams most. Worth specific monitoring:

NAT gateway egress
Cross-AZ data transfer (some is free, some isn't, depending on service)
Cross-region data transfer
Data transfer to internet (egress to user-facing services)

We have a dedicated dashboard for these. Sudden spikes in any of them indicate something to investigate:

A pod started downloading a large file from outside its region
A new feature is more chatty cross-AZ than expected
Egress to the internet grew (could be legitimate growth or a data exfiltration concern)

Reservation / Savings Plan utilization #

Monthly check: what's the utilization of our reservations? AWS shows this in the Cost Explorer.

< 80% utilization: we're paying for capacity we're not using; consider letting expire
80-95%: healthy, leave alone
95-100% with on-demand spillover: consider buying more reservations

We've sometimes underbought (high utilization, lots of on-demand, opportunity to save more) and sometimes overbought (low utilization, paying for unused capacity). The monthly check catches both.

Cost per outcome #

For features with measurable business outcomes:

LLM features: cost per successful interaction
Compute features: cost per processing unit
Search features: cost per query

These are calculated by combining cost data with operational metrics:

sql.sql

SELECT
  cost.month,
  cost.cost_usd,
  ops.total_interactions,
  cost.cost_usd / ops.total_interactions AS cost_per_interaction
FROM monthly_cost cost
JOIN monthly_operations ops ON cost.month = ops.month AND cost.feature = ops.feature
WHERE cost.feature = 'support_assistant'
ORDER BY month;

A trend up in cost-per-interaction is more meaningful than a trend up in absolute cost. Maybe interactions grew 50% (cost grew with it; healthy). Maybe cost per interaction grew (something got more expensive; investigate).

Showback to teams #

Monthly, each team gets a one-page report:

Total cost this month
Cost change vs last month (with % and $ change)
Top 5 services by cost in this team
Specific anomalies flagged
Recommendations from the platform team (e.g., "consider right-sizing X based on observed utilization")

Not chargeback (we don't actually charge teams for compute). Showback (visibility into cost). The visibility alone changes behavior.

Cost incidents we've had #

Specific things cost monitoring caught:

A new feature 100x'd costs overnight. Our chat feature got picked up by a high-volume customer; LLM costs jumped from $50/day to $5,000/day. Anomaly Detection alerted within 4 hours. We added per-customer rate limits.

A misconfigured backup retention policy. Snapshots from a service were kept indefinitely; we'd accumulated 18 months of hourly snapshots without anyone noticing. ~$1,200/month in storage. Cleanup + retention policy fixed it.

A development environment running prod-sized resources. Someone copy-pasted a Terraform module without right-sizing for dev. We caught it in the per-environment dashboard ("dev shouldn't cost more than $X").

A NAT gateway processing 50TB/month because a service was downloading models per-request instead of caching. Saved $2,000/month with VPC endpoints + caching.

What we don't track #

Some things don't pay off to monitor:

Per-pod cost in Kubernetes. Tools exist (kubecost, OpenCost). For our scale (single-team-per-namespace mostly), the AWS-level breakdown is sufficient. For multi-tenant clusters, kubecost would help.

Hour-by-hour cost. Daily granularity is enough; hour-by-hour adds noise without insight.

Per-Lambda invocation cost in real-time. The cost-per-invocation is small; the meaningful signal is at the function level over a day.

Detailed cost forecasting. AWS's built-in forecast is fine; building our own would be wheel-reinvention.

Cost of cost monitoring #

Real numbers:

Athena queries against CUR: ~$30/month
Cost data ingestion to Datadog: ~$100/month
Engineer time on dashboards / analysis: ~3 hours/week

Total: ~$130/month + engineer time. Compared to the savings (catching anomalies, ongoing optimization), large positive ROI.

What I'd tell a team starting #

Tag everything; activate cost allocation tags. Without this, breakdowns are useless.

Build the per-team breakdown first. Accountability flows from visibility.

Set anomaly detection. Catches what rule-based alerts miss.

Save common Athena queries. Ad-hoc analysis becomes much faster.

Cost per outcome where you can. Better business framing than absolute cost.

Monthly review meeting. Forces ongoing attention.

Cost monitoring is one of those infrastructure pieces where the ROI is unclear until you start; once you have it, you wonder how you operated without it. Anomalies surface quickly; team accountability emerges naturally; conversations about trade-offs become data-driven. The setup cost is real but small relative to the savings it enables.

Cloud Cost Monitoring: Tracking and Optimizing AWS Spending

Cloud Cost Monitoring: From Bills to Actionable Signals

The data sources #

The dashboards we actually look at #

The "is anything weird" dashboard #

The per-team breakdown #

The per-service trend #

Cost by feature #

Alerts that fire #

Specific queries we run #

Tagging discipline (again)#

Special focus: data transfer #

Reservation / Savings Plan utilization #

Cost per outcome #

Showback to teams #

Cost incidents we've had #

What we don't track #

Cost of cost monitoring #

What I'd tell a team starting #

Stay Updated

Systemd Tricks We Use to Keep Services Boring

How We Stopped Terraform Drift from Surprising On-Call

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

Four Signals That Matter: Choosing SLIs Users Actually Feel

NAT Gateway Costs: The Silent Line Item and How to Cut It

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas