A working Prometheus stack for a 40-node cluster: what we deploy, what we tune, and what we wish we'd known about cardinality two years ago.
We run Prometheus across a Kubernetes cluster with ~40 nodes and ~600 services. The setup has been stable for about two years. Most Prometheus tutorials get you to "metrics are showing up" but don't cover the operational realities that bite once you scale beyond a single node. This is the production-shaped guide.
Our stack:
We deploy it via the kube-prometheus-stack Helm chart. It bundles all the above with sensible defaults. Highly recommended starting point — much less assembly than wiring it up manually.
Prometheus's default storage settings (--storage.tsdb.retention.time=15d) are fine for exploration. For production:
The local 30 days is sufficient for almost all queries (incident review, recent trends). Thanos is for the rare "what did this metric look like 6 months ago?" question.
We started without Thanos and regretted it the first time we needed historical data and didn't have it. Add it from the start.
The default kube-prometheus-stack config scrapes everything Kubernetes-aware (pods with annotations, services with PromAnnotations) every 30 seconds.
Tuning we did:
Scrape interval per workload. High-cardinality, high-importance services (the API gateway, the database) we scrape every 15s. Low-importance (cron jobs, batch workers) we scrape every 60s. The 30s default is fine for most things.
Drop noisy series at scrape time. Many exporters emit metrics we don't use. We use metric_relabel_configs to drop them at scrape time, before they hit storage. Example: dropping all go_* metrics from services where we don't care about Go runtime details.
metric_relabel_configs:
- source_labels: [__name__]
regex: 'go_(memstats|gc).*'
action: drop
This cut our metrics volume by ~30% with no loss of useful data.
Honor __meta_kubernetes_pod_annotation_prometheus_io_scrape as opt-in for application metrics. Default-on scraping creates noise from internal services that emit useless metrics; we made it opt-in via annotation.
Cardinality = the number of unique label combinations per metric. A metric http_requests_total{method, path, status, ...} with 5 methods × 100 paths × 5 statuses = 2,500 series. Add customer_id as a label with 10,000 customers and you have 25M series. That kills Prometheus.
We learned this the hard way. Specific cardinality bombs we've defused:
request_id as a label. Each request had a unique ID; metrics turnover was so fast Prometheus couldn't keep up. Removed the label.pod_name on every metric. Pod names rotate (every deploy creates new pods). The label values accumulated. We rewrote the exporter to use deployment_name instead.499, 498, etc.) for niche framework reasons. We aliased these to standard codes.Detection: topk(20, count by (__name__) ({__name__=~".+"})) shows your top-20 highest-cardinality metrics. We run this monthly and have alerts on any metric exceeding 100k series.
Recording rules let you compute aggregations and store them as new metrics. They run continuously and the result is queryable like any metric.
Why they matter: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le, service)) is expensive to compute every time. As a recording rule, the computation happens once per evaluation interval (every 30s) and the result is fast to query.
Our heaviest dashboards used to take 30+ seconds to load. After adding recording rules for the common queries, < 2 seconds.
groups:
- name: http_recording_rules
interval: 30s
rules:
- record: service:http_request_p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le, service))
- record: service:http_error_rate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
Rule of thumb: if a query takes more than 1 second to load, it's a candidate for a recording rule.
We have ~50 alert rules total. The temptation is to alert on every metric; the reality is most alerts you'd write are noise.
Our alert categories:
What we don't alert on:
Each alert has:
HighAPILatency, not Alert42)Example:
- alert: HighAPILatency
expr: service:http_request_p99{service="api"} > 1.0
for: 5m
labels:
severity: page
annotations:
summary: "API p99 latency above 1s for 5m"
runbook: https://wiki.internal/runbooks/api-latency
Alertmanager handles what to do with fired alerts:
Our config has ~5 routes. The most useful piece: inhibit_rules that prevent cascading alert noise. When the database is down, we don't want every dependent service's alert also paging.
We have ~80 dashboards. Three rules that keep them manageable:
grafana-operator to apply them from K8s ConfigMaps.We resist the temptation to build elaborate dashboards. A dashboard that takes 30 seconds to load won't be checked.
For services with defined SLOs (success rate, latency), we have an SLO dashboard with:
The error budget calculation:
error_budget_remaining = 1 - (1 - SLI) / (1 - SLO_target)
E.g., if SLO target is 99.9% and current SLI is 99.95%, we've consumed half the error budget (0.05/0.1). When budget reaches 0, we stop pushing risky changes until it recovers.
Specific things we hit:
Prometheus pod OOMing during heavy queries. A user's exploratory query that selected 10M series tipped Prometheus over. We added query-time protections (--query.max-samples=50000000) and an alert when memory was near limit.
Disk full when retention wasn't bounded. A bad scrape config briefly exploded cardinality, filling disk, which corrupted the WAL. Now we have a hard disk-usage alert and --storage.tsdb.retention.size=80GB as a guard.
Federation latency at scale. We had a hub-and-spoke Prometheus pattern. The hub couldn't keep up with the spokes. We replaced this with Thanos, which queries multiple Prometheus servers without the federation pattern.
Alertmanager flapping during partial network failures. When connectivity to PagerDuty hiccupped, alerts queued, then flooded when connectivity came back. We added repeat_interval tuning and PD's own dedup keys.
Self-hosted on existing K8s nodes:
Compare to managed alternatives (Datadog, Grafana Cloud) which would run us thousands per month at our metric volume. The self-hosted setup pays for itself many times over but the team time was real (~3 weeks initial setup, ~1 hour/week ongoing).
Use kube-prometheus-stack, don't assemble manually. It's the working baseline. Customize from there.
Watch cardinality from day one. Set up alerts on top-cardinality metrics. The cleanup project later is much harder than prevention.
Add Thanos before you need historical data. Adding it later is fine but you start with no historical data.
Recording rules for any query you run often. Cheap performance win.
Alert on symptoms, not causes. What does the user feel? Alert on that. Cause-based alerts proliferate and become noise.
Dashboards in Git. UI-edited dashboards drift, get duplicated, become unmaintainable.
Don't try to monitor everything. Pick the SLIs that matter. Most metrics are debugging aids; only a handful are alert-worthy.
A working Prometheus stack is one of the highest-leverage investments a team can make. Once it's running well, debugging production becomes faster, alerting becomes calibrated, and the conversation about reliability becomes data-driven instead of opinion-based. The setup is a few weeks; the payoff is years.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.