A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.

On this page

Setting Up Infrastructure Monitoring with Prometheus

We run Prometheus across a Kubernetes cluster with ~40 nodes and ~600 services. The setup has been stable for about two years. Most Prometheus tutorials get you to "metrics are showing up" but don't cover the operational realities that bite once you scale beyond a single node. This is the production-shaped guide.

The components, briefly #

Our stack:

Prometheus (server) — scrapes metrics, stores TSDB
Alertmanager — receives alerts, dedupes, routes to Slack/PagerDuty
Grafana — dashboards
node_exporter — host-level metrics (one per node, DaemonSet)
kube-state-metrics — Kubernetes object state
Various exporters for specific services (postgres_exporter, redis_exporter, blackbox_exporter)

We deploy it via the kube-prometheus-stack Helm chart. It bundles all the above with sensible defaults. Highly recommended starting point — much less assembly than wiring it up manually.

Storage: not what defaults give you #

Prometheus's default storage settings (--storage.tsdb.retention.time=15d) are fine for exploration. For production:

We retain 30 days locally on each Prometheus
For longer-term storage we use Thanos, which uploads compacted blocks to S3
Total local Prometheus storage: ~80GB across 40 nodes' worth of metrics
S3 storage in Thanos: ~600GB for 18 months of retention

The local 30 days is sufficient for almost all queries (incident review, recent trends). Thanos is for the rare "what did this metric look like 6 months ago?" question.

We started without Thanos and regretted it the first time we needed historical data and didn't have it. Add it from the start.

Scrape config: what to scrape, how often #

The default kube-prometheus-stack config scrapes everything Kubernetes-aware (pods with annotations, services with PromAnnotations) every 30 seconds.

Tuning we did:

Scrape interval per workload. High-cardinality, high-importance services (the API gateway, the database) we scrape every 15s. Low-importance (cron jobs, batch workers) we scrape every 60s. The 30s default is fine for most things.

Drop noisy series at scrape time. Many exporters emit metrics we don't use. We use metric_relabel_configs to drop them at scrape time, before they hit storage. Example: dropping all go_* metrics from services where we don't care about Go runtime details.

yaml.yaml

metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'go_(memstats|gc).*'
    action: drop

This cut our metrics volume by ~30% with no loss of useful data.

Honor __meta_kubernetes_pod_annotation_prometheus_io_scrape as opt-in for application metrics. Default-on scraping creates noise from internal services that emit useless metrics; we made it opt-in via annotation.

The cardinality problem #

Cardinality = the number of unique label combinations per metric. A metric http_requests_total{method, path, status, ...} with 5 methods × 100 paths × 5 statuses = 2,500 series. Add customer_id as a label with 10,000 customers and you have 25M series. That kills Prometheus.

We learned this the hard way. Specific cardinality bombs we've defused:

A custom metric with request_id as a label. Each request had a unique ID; metrics turnover was so fast Prometheus couldn't keep up. Removed the label.
An exporter that put pod_name on every metric. Pod names rotate (every deploy creates new pods). The label values accumulated. We rewrote the exporter to use deployment_name instead.
HTTP status codes including custom ones (499, 498, etc.) for niche framework reasons. We aliased these to standard codes.

Detection: topk(20, count by (__name__) ({__name__=~".+"})) shows your top-20 highest-cardinality metrics. We run this monthly and have alerts on any metric exceeding 100k series.

Recording rules: pre-aggregate everything you query often #

Recording rules let you compute aggregations and store them as new metrics. They run continuously and the result is queryable like any metric.

Why they matter: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le, service)) is expensive to compute every time. As a recording rule, the computation happens once per evaluation interval (every 30s) and the result is fast to query.

Our heaviest dashboards used to take 30+ seconds to load. After adding recording rules for the common queries, < 2 seconds.

yaml.yaml

groups:
  - name: http_recording_rules
    interval: 30s
    rules:
      - record: service:http_request_p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le, service))
      - record: service:http_error_rate
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
              / sum(rate(http_requests_total[5m])) by (service)

Rule of thumb: if a query takes more than 1 second to load, it's a candidate for a recording rule.

Alerting: less than you'd think #

We have ~50 alert rules total. The temptation is to alert on every metric; the reality is most alerts you'd write are noise.

Our alert categories:

Symptom-based (alerts on what users see): error rate up, latency up, success rate down.
Cause-based for known issues (alerts on early indicators): memory pressure, disk full, replication lag. These get a runbook entry.
Health checks (something is unreachable): Prometheus itself, alertmanager, etc.

What we don't alert on:

CPU usage above X% (CPU spikes happen; if user impact follows, the symptom alert fires)
"Pod restarted" (deploys cause restarts; not actionable on its own)
Disk usage on caches and ephemeral storage (filling up and being cleaned is normal)

Each alert has:

A clear name (HighAPILatency, not Alert42)
A severity (page, ticket, or info)
A runbook URL
A "for" duration (don't fire on transient spikes)

Example:

yaml.yaml

- alert: HighAPILatency
  expr: service:http_request_p99{service="api"} > 1.0
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "API p99 latency above 1s for 5m"
    runbook: https://wiki.internal/runbooks/api-latency

Alertmanager: routing and silencing #

Alertmanager handles what to do with fired alerts:

Group similar alerts (10 pods firing the same alert → 1 notification)
Route by severity (page → PagerDuty; ticket → Slack)
Silence during maintenance windows
Inhibit (e.g., "Cluster down" inhibits "API down" — no need to page twice)

Our config has ~5 routes. The most useful piece: inhibit_rules that prevent cascading alert noise. When the database is down, we don't want every dependent service's alert also paging.

Grafana: dashboards as code #

We have ~80 dashboards. Three rules that keep them manageable:

Dashboard JSON in Git. Edits in the Grafana UI export to JSON, get committed. We use grafana-operator to apply them from K8s ConfigMaps.
One dashboard per service. Per-service dashboards have similar layout (RED metrics — Rate, Errors, Duration — plus service-specific). Copy-paste from a template, customize the queries.
A handful of overview dashboards. The cluster overview, the SLO dashboard, the on-call dashboard. These get used more than the per-service ones.

We resist the temptation to build elaborate dashboards. A dashboard that takes 30 seconds to load won't be checked.

SLO dashboards specifically #

For services with defined SLOs (success rate, latency), we have an SLO dashboard with:

Current SLI value (last 5 min)
Error budget remaining (last 30 days)
Burn rate (how fast we're consuming the budget)
Trend chart (last 7 days)

The error budget calculation:

code

error_budget_remaining = 1 - (1 - SLI) / (1 - SLO_target)

E.g., if SLO target is 99.9% and current SLI is 99.95%, we've consumed half the error budget (0.05/0.1). When budget reaches 0, we stop pushing risky changes until it recovers.

What broke at scale #

Specific things we hit:

Prometheus pod OOMing during heavy queries. A user's exploratory query that selected 10M series tipped Prometheus over. We added query-time protections (--query.max-samples=50000000) and an alert when memory was near limit.

Disk full when retention wasn't bounded. A bad scrape config briefly exploded cardinality, filling disk, which corrupted the WAL. Now we have a hard disk-usage alert and --storage.tsdb.retention.size=80GB as a guard.

Federation latency at scale. We had a hub-and-spoke Prometheus pattern. The hub couldn't keep up with the spokes. We replaced this with Thanos, which queries multiple Prometheus servers without the federation pattern.

Alertmanager flapping during partial network failures. When connectivity to PagerDuty hiccupped, alerts queued, then flooded when connectivity came back. We added repeat_interval tuning and PD's own dedup keys.

Cost #

Self-hosted on existing K8s nodes:

Prometheus pod: ~12GB RAM, 4 vCPU
Thanos sidecar/store/compactor: ~6GB RAM
node_exporter, kube-state-metrics, exporters: ~2GB total
S3 for Thanos: ~$15/month for 600GB

Compare to managed alternatives (Datadog, Grafana Cloud) which would run us thousands per month at our metric volume. The self-hosted setup pays for itself many times over but the team time was real (~3 weeks initial setup, ~1 hour/week ongoing).

What I'd tell a team starting #

Use kube-prometheus-stack, don't assemble manually. It's the working baseline. Customize from there.

Watch cardinality from day one. Set up alerts on top-cardinality metrics. The cleanup project later is much harder than prevention.

Add Thanos before you need historical data. Adding it later is fine but you start with no historical data.

Recording rules for any query you run often. Cheap performance win.

Alert on symptoms, not causes. What does the user feel? Alert on that. Cause-based alerts proliferate and become noise.

Dashboards in Git. UI-edited dashboards drift, get duplicated, become unmaintainable.

Don't try to monitor everything. Pick the SLIs that matter. Most metrics are debugging aids; only a handful are alert-worthy.

A working Prometheus stack is one of the highest-leverage investments a team can make. Once it's running well, debugging production becomes faster, alerting becomes calibrated, and the conversation about reliability becomes data-driven instead of opinion-based. The setup is a few weeks; the payoff is years.

Infrastructure Monitoring with Prometheus: Complete Setup Guide

Setting Up Infrastructure Monitoring with Prometheus

The components, briefly #

Storage: not what defaults give you #

Scrape config: what to scrape, how often #

The cardinality problem #

Recording rules: pre-aggregate everything you query often #

Alerting: less than you'd think #

Alertmanager: routing and silencing #

Grafana: dashboards as code #

SLO dashboards specifically #

What broke at scale #

Cost #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas