Defining monitoring as code: dashboards, alerts, and SLOs in Git. The patterns that survived the migration from clicked-together monitoring.
About two years ago our monitoring lived in a mix of Datadog UI clicks, Grafana dashboards drifted from anyone's memory, and PagerDuty configs nobody could explain. We migrated everything to "monitoring as code" — dashboards, alerts, and SLOs in Git. This post is the working version after the dust settled.
Specific pain points that drove the change:
Drift. Two engineers click-fixed a dashboard slightly differently. Now there are two slightly-different copies. Nobody is sure which is canonical.
No code review. A new alert was created by clicking around. It was misconfigured. It paged at 3 AM for nothing. Nobody had reviewed it before it went live.
No history. "Why is this alert configured this way?" "Who knows; whoever set it up is no longer on the team."
Replicating across environments. A new staging environment needs the same dashboards as prod. "Just click through and re-create them" is slow and error-prone.
Restoring after corruption. A botched Datadog API call deleted a bunch of monitors. Restoring them was hours of work.
All of these go away when monitoring lives in Git.
Categories:
Dashboards (Grafana, Datadog): JSON definitions in repo. Synced via operator (Grafana operator) or CD pipeline (Datadog API on merge).
Alerts / Monitors: YAML or JSON definitions. Same sync pattern.
SLOs: separate definition files. Each SLO has a target, a SLI, an alerting policy, and an error budget.
Notification policies: who gets paged for what. Routes, escalations, oncall schedules (where the tool allows).
Recording rules (Prometheus): definitions in YAML, applied to Prometheus.
Custom queries / saved views: useful queries we've built up live in repo as runbook content.
What we did NOT put in Git:
For our stack:
Grafana: dashboards as JSON, applied via the Grafana operator (reads ConfigMaps). Alerts via grafana-operator's CRDs.
Datadog: dashboards and monitors managed via Terraform (Datadog provider). Could use Datadog's CLI or API directly; Terraform fits our existing IaC tooling.
PagerDuty: schedules, escalation policies, services managed via Terraform.
Prometheus: alert rules as YAML, applied via the kube-prometheus-stack chart. Recording rules same way.
The pattern: each tool has some way to apply config from files. We use whatever fits — Terraform for SaaS APIs, K8s operators for in-cluster tools, plain CD pipelines for the rest.
Our monitoring repo:
monitoring/
dashboards/
grafana/
service-overview.json
database-health.json
...
datadog/
api-performance.json
...
alerts/
grafana/
...
datadog/
...
slos/
api-availability.yaml
checkout-latency.yaml
pagerduty/
services.tf
schedules.tf
Each file is reviewable on its own. PRs touching monitoring go through the same code review as anything else.
A typical monitoring change:
The biggest cultural change: monitoring changes go through code review. Engineers got used to clicking through monitoring tools; now they edit YAML and wait for a review. Friction is real but the quality benefit is clear.
Across ~80 services, we wanted dashboards to look similar so engineers could orient quickly. Standard dashboard structure:
We have a Grafana dashboard template. Adding a new service: copy the template, change the data source filter to the new service. ~10 minutes vs an afternoon for hand-built dashboards.
For Datadog, similar — a dashboard template that team owners copy and customize.
Alerts as code makes alert hygiene visible. We have rules:
Every alert has an owner. A team or person; not "platform." Documented in the alert metadata.
Every alert has a runbook. A URL pointing to documentation. The runbook explains what the alert means, what to investigate, and what to do.
Every alert has a severity. Page (wakes someone up), ticket (work item, daytime response), info (FYI in chat).
Every alert has a "for" duration. Don't fire on a single bad data point.
We have a CI check that rejects alert PRs missing any of these. Without enforcement, the discipline drifts.
Quarterly review:
We've gotten our alert volume down to a sustainable level via these reviews. Without them, alerts accumulate and become noise.
For services with defined SLOs, we have YAML definitions:
name: api-availability
service: api
sli:
type: availability
query: 'sum(rate(http_requests_total{service="api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="api"}[5m]))'
target: 0.999 # 99.9% availability
window: 30d
alerting:
burn_rate_fast: { window: 1h, threshold: 14.4 } # 2% burn in 1h
burn_rate_slow: { window: 6h, threshold: 6.0 } # 5% burn in 6h
A small operator reads these and creates the corresponding Prometheus rules and Grafana dashboards. The YAML is the source; everything else is generated.
When the SLO target changes (e.g., we tighten from 99.5% to 99.9%), edit the YAML. The downstream artifacts regenerate.
The migration took ~3 months. Approach:
The "lock UI" step was important. Without it, engineers continued to click-edit the migrated artifacts; drift returned. With UI write access removed (or at least restricted to admins for emergencies), the discipline held.
Specific issues during the migration:
Dashboard exports had volatile fields. Datadog dashboard JSON includes generated fields (timestamps, internal IDs). Exporting and committing these caused git diffs on every save even when nothing meaningful changed. We wrote a normalizer that strips volatile fields before commit.
Auto-generated names conflicting. Tools generate IDs from names; renaming a dashboard would create a duplicate. We added pre-merge checks for ID conflicts.
Apply pipelines failed silently. Initial CD pipeline reported "applied" even when the API rejected the change. We added explicit verification that the apply took effect.
Backward-incompatible Datadog API changes. Once or twice, Datadog changed the API and our committed configs broke. Pinned to specific Terraform provider versions and updated deliberately.
A few corners we haven't tackled:
On-call schedule rotations as code. PagerDuty schedules are partially in code, but actual rotation (who's on call this week) is still managed in the UI. The complexity of "fairness across timezones, vacation, etc." doesn't fit cleanly in static config.
Notebook-style operational analyses. When someone investigates an incident in a Datadog notebook, that's not in Git. We're OK with this — it's exploratory work.
Auto-generated dashboards. Some teams have asked for "spin up a dashboard automatically when a new service deploys." We haven't built this; teams write their own dashboards using the template.
Operational cost:
Compared to the operational cost of drift, broken dashboards, and noisy alerts: large positive ROI.
Migrate one tool at a time. Don't try to convert everything at once.
Lock UI write access after migration. Without this, drift returns.
Standardize templates. Dashboards that look similar across services are easier to navigate.
Every alert has owner, runbook, severity, duration. Enforce in CI.
Review alerts quarterly. Without active pruning, noise accumulates.
SLOs as YAML; downstream artifacts generated. One source of truth.
Don't put ad-hoc work in Git. Investigation dashboards stay in the UI; production monitoring goes in Git.
Monitoring as code isn't glamorous infrastructure work, but the day-to-day quality-of-life improvements are real. Engineers stop fighting drift; new services get good dashboards instantly; alerts are reviewed before firing. Like most discipline-driven improvements, the value compounds — six months in, the system is much better than the click-driven version, and you don't really notice anymore. That's the goal.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Embed cost ownership in engineering: tags, budgets, and showback.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
Evergreen posts worth revisiting.