Error budgets turn "how reliable should we be?" into a number both product and SRE can spend. Here is how we set, alert on, and enforce them.

On this page

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

Every team eventually has the same argument. Product wants to ship. Ops wants stability. The debate runs on gut feeling, and whoever is louder in the meeting wins. Error budgets end that argument by putting a number on the table that both sides have to respect.

The idea is small once it clicks. You decide how reliable a service needs to be, not how reliable it can possibly be. The gap between those two is your budget to spend on shipping. When you have budget left, you ship. When you have burned through it, you stop and fix things. No more arguing from vibes.

We have run this on services doing a few hundred requests a second and on ones doing tens of thousands. The mechanics are the same. What changes is how much discipline the team has to enforce the policy when the budget runs dry, and that is the part most write-ups skip.

SLI, SLO, error budget: three words people fuse #

Three terms, and people fuse them constantly, so let me pin them down.

An SLI, service level indicator, is a measurement of how the service is doing. It is a ratio of good events to total events. "Fraction of HTTP requests that returned a non-5xx status in under 300ms." That is an SLI. Notice it is a number you can compute from data you already have.

An SLO, service level objective, is the target you hold that SLI to over a window. "99.9% of requests are good over a rolling 30 days." That is an SLO.

The error budget is what falls out of the SLO by subtraction. If your objective is 99.9% good, then 0.1% bad is allowed. That 0.1% is the budget. It is not failure. It is the allowance you get to spend on risk: deploys, migrations, config changes, the occasional bad Tuesday.

Do the arithmetic once and it stops feeling abstract. A 30-day month has 43,200 minutes. At 99.9%, 0.1% of that is 43 minutes. So a 99.9% availability SLO buys you about 43 minutes of downtime a month. Move to 99.95% and you have roughly 21 minutes. Push to 99.99% and you are down to about 4 minutes, which means no human can be in the loop for recovery. Each nine costs an order of magnitude more effort, so choose the target that matches what users actually need, not the most impressive one.

Pick SLIs users can feel #

The most common failure is measuring what is easy instead of what matters. CPU is easy. Nobody files a ticket because your CPU hit 80%. They file a ticket because the page took nine seconds or the checkout button did nothing.

Good SLIs sit at the boundary where the user meets the system. Two carry most of the weight:

Availability: of the requests users made, what fraction succeeded? Define "succeeded" carefully. A 500 is a failure. A 400 from a client sending garbage usually is not your fault, so exclude it. A slow-but-eventually-correct response might count as a failure too, depending on the service.
Latency: what fraction of requests came back fast enough? Always express this as a threshold, never an average. "95% of requests under 300ms" is a claim you can act on. "Average latency 210ms" hides the tail where the pain lives. One request at 12 seconds and ninety-nine at 50ms averages beautifully and still means one user is furious.

Measure at the load balancer or gateway if you can, because that is closest to what the user saw. Measuring inside the app misses the requests that never made it in.

Here is an availability SLI written as a Prometheus recording rule, so the ratio is computed once and reused everywhere:

yaml.yaml

groups:
  - name: checkout-sli
    rules:
      - record: job:checkout_requests:good_ratio_5m
        expr: |
          sum(rate(http_requests_total{job="checkout",code!~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="checkout"}[5m]))

From target to budget #

Setting the target is a business decision wearing an engineering costume. Ask what a user tolerates before they leave or complain, then set the SLO a little tighter than that. For most internal and mid-tier services, 99.9% is a sane starting point. Payment paths might want 99.95%. A batch reporting job might be fine at 99%, and pretending otherwise just wastes on-call sleep.

Write the SLO down somewhere the team reads. We keep ours as a small YAML block next to the service, version-controlled, so a change to the target is a reviewed pull request rather than a hallway agreement:

yaml.yaml

service: checkout
slo:
  objective: 0.999          # 99.9% of requests good
  window: 30d               # rolling
  indicator: availability   # non-5xx responses
budget:
  # 0.1% of requests over 30 days
  policy: freeze-on-exhaust

Alert on burn rate, not on thresholds #

This is where most monitoring setups go wrong, and it is worth getting right. The naive alert is "page if error rate > 1% for 5 minutes." It fires on every minor blip that self-heals, the team learns to ignore it, and the one time it matters nobody moves. Alert fatigue is not a personality flaw, it is a design flaw.

Burn rate reframes the question. Instead of "is the error rate high right now?" you ask "are we spending the budget faster than the window can afford?" A burn rate of 1 means you will exactly exhaust the budget at the end of the window. A burn rate of 14.4 means you are torching a 30-day budget in about two days.

The pattern that works is multi-window, multi-burn-rate. You run two conditions at once:

A fast burn alert: a high burn rate (say 14.4x) over a short window (an hour, plus a five-minute check to confirm it is still happening). This pages a human. Something is on fire now.
A slow burn alert: a lower burn rate (around 3x) over a long window (six hours). This opens a ticket, not a page. You are leaking budget steadily and should look before the weekend.

Requiring a short confirmation window alongside the long one kills the flapping. The alert only fires if the burn is real over the long window and still happening in the last few minutes.

yaml.yaml

groups:
  - name: checkout-slo-burn
    rules:
      # Fast burn: 14.4x over 1h AND 5m, page immediately
      - alert: CheckoutErrorBudgetFastBurn
        expr: |
          (1 - job:checkout_requests:good_ratio_1h) > (14.4 * 0.001)
          and
          (1 - job:checkout_requests:good_ratio_5m) > (14.4 * 0.001)
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "Checkout burning error budget 14.4x (fast)"

      # Slow burn: 3x over 6h AND 30m, ticket not a page
      - alert: CheckoutErrorBudgetSlowBurn
        expr: |
          (1 - job:checkout_requests:good_ratio_6h) > (3 * 0.001)
          and
          (1 - job:checkout_requests:good_ratio_30m) > (3 * 0.001)
        for: 15m
        labels: { severity: ticket }
        annotations:
          summary: "Checkout burning error budget 3x (slow)"

The 0.001 in there is your budget: 1 minus the 0.999 objective. Change the SLO, change one number. If you want the longer version of why this beats static thresholds, we wrote it up separately in burn-rate alerting and the SLO discipline that prevents alert fatigue.

The budget policy is the whole point #

An error budget with no consequence is a dashboard nobody looks at. The policy is the teeth. Agree it in advance, in writing, while everyone is calm, because you will not negotiate it fairly at 2am with the budget on fire.

Ours reads roughly like this. While budget remains, product owns the pace and ships freely; risky changes are fine because that is what the budget is for. When the budget is exhausted, non-essential feature deploys freeze. The team shifts to reliability work, whatever caused the burn, until the rolling window recovers enough headroom. Security fixes and rollbacks are always exempt.

The freeze is not punishment. It is the system telling you that you have been spending faster than you can sustain, and the honest response is to slow down and pay it back. The first time a team actually honors a freeze is uncomfortable. Every time after, the arguments get shorter because everyone knows the rule is real.

This is what makes error budgets a shared decision tool instead of an SRE stick. Product is not asking permission to ship. They are watching the same number. When it is healthy, nobody needs a meeting. When it is not, the conversation is about which reliability work clears the fastest, not about whether to care.

Mistakes we keep watching people make #

Too many SLOs. A team defines an SLO for every endpoint and every dependency, forty of them, and now nobody can say if the service is healthy. Start with two or three that map to user journeys. Fewer, meaningful SLOs beat a wall of them.
SLOs nobody enforces. The budget runs out, everyone shrugs, shipping continues. Within a quarter the SLO is decoration. If you are not willing to freeze, do not claim to have an error budget, because you have a metric.
Alerting on causes instead of symptoms. Paging on high CPU, low disk, or a restarted pod means paging on things that may not affect users at all, and missing the ones that do. Page on the symptom the user feels, failed requests or slow requests, and let causes be what you investigate after the page, not what triggers it. We went deeper on this in alerting on symptoms, not causes.
Chasing nines nobody asked for. Someone sets 99.99% because it sounds serious, and now the team burns weeks engineering out four minutes of monthly downtime for a service where users would not notice ten. Match the target to the need.

The call we'd make #

Start with one service, one or two SLIs users actually feel, and a 99.9% objective you can defend. Wire up multi-window burn-rate alerts and delete the static threshold ones. Then write the budget policy down and, the first time the budget runs out, honor the freeze even though it is inconvenient. That first honored freeze is what converts an error budget from a slide into a habit. Everything else is refinement.

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

SLI, SLO, error budget: three words people fuse #

Pick SLIs users can feel #

From target to budget #

Alert on burn rate, not on thresholds #

The budget policy is the whole point #

Mistakes we keep watching people make #

The call we'd make #

Stay Updated

Platform Engineering with Backstage: Build a Useful Developer Portal

Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work

More from DevOps

Best CI/CD Platforms in 2026 — GitHub Actions, GitLab, Jenkins, and More

Drone vs Woodpecker CI — Lightweight Self-Hosted CI

Buildkite vs GitHub Actions — Hybrid CI Compared

Best CI/CD Platforms in 2026 — GitHub Actions, GitLab, Jenkins, and More

Drone vs Woodpecker CI — Lightweight Self-Hosted CI

Buildkite vs GitHub Actions — Hybrid CI Compared

Kubernetes Error Troubleshooting — The Complete Guide

Best APM and Observability Tools in 2026 — Compared by Cost and Use Case

How to Reduce Datadog Costs Without Losing Coverage

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Embedding Models Comparison: Choosing the Right Model for Your Use Case

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas

SRE Error Budgets in Practice: Shipping Fast Without Burning Reliability

SLI, SLO, error budget: three words people fuse#

Pick SLIs users can feel#

From target to budget#

Alert on burn rate, not on thresholds#

The budget policy is the whole point#

Mistakes we keep watching people make#

The call we'd make#

Stay Updated

Platform Engineering with Backstage: Build a Useful Developer Portal

Kubernetes Cost Optimization for Teams: FinOps Tactics That Actually Work

More from DevOps

Best CI/CD Platforms in 2026 — GitHub Actions, GitLab, Jenkins, and More

Drone vs Woodpecker CI — Lightweight Self-Hosted CI

Buildkite vs GitHub Actions — Hybrid CI Compared

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Embedding Models Comparison: Choosing the Right Model for Your Use Case

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas

SLI, SLO, error budget: three words people fuse #

Pick SLIs users can feel #

From target to budget #

Alert on burn rate, not on thresholds #

The budget policy is the whole point #

Mistakes we keep watching people make #

The call we'd make #