We track the four DORA metrics plus a handful of others. The trade-off between what's measurable and what's meaningful, and how we use the numbers.

On this page

DevOps Metrics That Matter

The DORA metrics (deployment frequency, lead time, change failure rate, MTTR) are widely cited and rarely well-implemented. We've tracked them for about three years across our engineering org. The metrics are useful when you use them right and misleading when you don't. This post is what we measure, how we use the numbers, and the metrics conversation we've had to have repeatedly.

The four DORA metrics, briefly #

For reference:

Deployment frequency: how often you ship to production
Lead time for changes: time from commit to production
Change failure rate: percentage of deploys that cause incidents
Mean time to recovery (MTTR): how long incidents last

The original DORA research correlates these with org performance. Faster, more reliable shipping → better business outcomes. The relationship is real; the implementation matters.

What we actually measure #

Our metric set, with target ranges:

Metric	Target	Current
Deployment frequency (per team)	Daily	Daily for 80% of teams
Lead time (commit to prod)	< 1 day	~14 hours median
Change failure rate	< 15%	~9%
MTTR	< 1 hour	~45 min median
Code review time	< 1 day	~6 hours median
Test suite duration	< 15 min	~12 min
Production p99 latency	< 500ms	varies per service
Error budget burn rate	< 1.5x	varies

The DORA four are headline; the others are leading indicators or operational signals.

How we collect them #

Most metrics derive from existing systems:

Deploy frequency / lead time: from CI/CD logs (we use GitHub Actions + Argo CD). We have a small service that watches deployment events and computes per-team metrics.
Change failure rate: incidents tagged with the deploy that caused them. Manual tagging by the on-call during incident resolution.
MTTR: from PagerDuty incident open/close timestamps.
Code review time: GitHub PR creation → first review.
Test suite duration: CI logs.
Production metrics: Prometheus / Datadog.

Total system that produces metrics: ~3,000 lines of code + a Postgres table + a dashboard. Not glamorous, but reliable.

What the metrics actually tell you #

The metrics are diagnostics, not goals. Reading them:

Low deployment frequency might mean: heavyweight deploy process, fear of breaking things, large batch sizes, inefficient code review. Each has different fixes.

Long lead time might mean: slow CI, painful code review, environment setup overhead, manual approvals.

High change failure rate might mean: insufficient testing, lack of canary deploys, complex changes, cultural pressure to ship fast at expense of quality.

Long MTTR might mean: poor observability, missing runbooks, on-call doesn't know the system, complex rollback path.

When a metric trends wrong, the diagnostic is "what specifically is causing this?" Not "the metric is bad; ship faster."

What we DON'T do with metrics #

The mistakes that ruin DORA metrics:

Don't use them as performance reviews. "Sara's team has lower deployment frequency than David's team" → Sara's team starts shipping smaller PRs to game the metric, quality drops. The metric becomes useless.

Don't set arbitrary targets across teams. Different teams have different shapes of work. A platform team that ships once a week might be perfectly healthy; a feature team that ships once a week probably isn't. Compare to baseline, not to other teams.

Don't optimize the metric directly. "We need to improve deployment frequency" → batches get smaller, but if the bottleneck was code review (not deploy mechanics), nothing has actually improved. Diagnose the bottleneck; fix that.

Don't ignore quality metrics. Deployment frequency without change failure rate is incomplete. Fast and broken is worse than slow and stable.

The conversation we've had repeatedly #

When someone (usually leadership) sees the metrics, the first reaction is often "let's improve [metric] by X% next quarter." This is the wrong framing.

The right framing: "We see [metric] is at [value]. What's blocking it from being better? Are those blockers worth removing?"

Sometimes the answer is: yes, these blockers are pure friction; let's remove them. We've cut lead time from 5 days to 14 hours over two years by removing specific blockers (slow CI, painful manual approval steps, etc.).

Sometimes the answer is: the blockers are there for a reason. A regulated payment service has slower lead time because of review requirements; that's correct, not a bug.

The metric tells you "where to look"; it doesn't tell you "what to fix."

Per-team vs aggregate #

We compute metrics per team and as an org-wide rollup.

Per-team is more useful diagnostically. Org-wide aggregates trends but can hide signal.

Per-team has its own pitfall: teams with very different work shapes (a research team, a customer-facing team, a platform team) shouldn't be compared on the same metrics with the same targets.

We have separate "tiers" of teams:

Customer-facing product teams: aggressive targets (daily deploy, sub-day lead time)
Platform / infrastructure teams: weekly deploy is fine; reliability metrics are primary
Data / ML teams: deploy frequency is less meaningful; their metrics are about model freshness and quality

Beyond DORA: SPACE and others #

DORA is one framework; there are others:

SPACE (Satisfaction, Performance, Activity, Communication, Efficiency): broader, includes developer experience.

Engineer satisfaction survey (we run quarterly): captures things metrics can't (frustration, blockers, feeling of progress).

Code review quality: not just speed but rigor. We sample reviewed PRs occasionally to check that reviews aren't rubber-stamps.

We use DORA as the primary numerical headline because it's simple and well-understood. SPACE-style measurement and surveys add the qualitative side.

Specific changes we made based on metrics #

Examples of metric-driven improvements:

Lead time was 5 days. Investigation: code review averaged 2.5 days. Fix: dedicated review time blocks, automated reviewers for small PRs, escalation if a PR sat un-reviewed > 24 hours. Lead time dropped to ~14 hours.

Deployment frequency stalled. Investigation: deploys took 45 min and required manual approvals at 4 stages. Fix: trimmed approvals to 2 (one technical, one business for sensitive changes), parallelized deploy steps. Deploy frequency increased; team morale improved (fewer "waiting on a deploy" frustrations).

Change failure rate spiked one quarter. Investigation: a new architecture pattern caused a class of bugs we hadn't seen before. Fix: added integration tests for the pattern, training on the new failure modes. Change failure rate dropped back to baseline within 2 sprints.

MTTR was bimodal — most incidents resolved in 15 min, a few took 8+ hours. Investigation: the long ones lacked runbooks for specific subsystems. Fix: runbook gap analysis, dedicated work to fill gaps. Long-tail MTTR improved.

The metrics didn't fix anything by themselves. They surfaced where to look.

What metrics don't capture #

Things the dashboard misses:

Quality of decisions: a team can ship daily and ship the wrong thing. Deployment frequency doesn't capture business impact.

Tech debt accrual: a team can hit all velocity metrics while building unmaintainable code. Eventually catches up.

Engineer well-being: the metrics can be green while the team is burning out.

Innovation: experimental work and research often doesn't produce shippable code on a regular cadence. Metric-targeting can discourage exploration.

These are the reasons the metrics aren't a complete picture. They're a useful piece of the picture; not the whole.

What I'd tell a team starting #

Start with the four DORA metrics. They're well-understood and easy to compute.

Collect them for at least a quarter before acting. Trend matters more than a snapshot.

Use metrics for diagnosis, not as performance targets. "Where's the bottleneck?" not "improve this number."

Don't compare teams with different work shapes. Per-team baselines, not relative rankings.

Pair quantitative metrics with qualitative signals (surveys, retros). The numbers can hide problems people see directly.

Resist the "let's set a 20% improvement target" framing. Identify specific blockers; remove them; the metric improves as a side effect.

DevOps metrics are valuable when they point at things you can change. They're harmful when they become the goal itself. The discipline is in keeping the focus on what's slowing teams down — and trusting that fixing those things will move the metrics naturally.

DevOps Metrics and KPIs: Measuring Success

DevOps Metrics That Matter

The four DORA metrics, briefly #

What we actually measure #

How we collect them #

What the metrics actually tell you #

What we DON'T do with metrics #

The conversation we've had repeatedly #

Per-team vs aggregate #

Beyond DORA: SPACE and others #

Specific changes we made based on metrics #

What metrics don't capture #

What I'd tell a team starting #

Stay Updated

Multi-Region Resilience: Failover, Data, and DNS

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas