We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.

Incident Postmortems That Actually Prevent Repeat Failures

For about two years our postmortems were beautifully written. Each one had a Five Whys section, a clean timeline, an action items table, and a thoughtful blameless tone. We linked them in the engineering wiki. We presented them at the quarterly review. They were really good documents.

We also kept having the same kinds of incidents. The same database failover would surprise us once a quarter. The same kind of cache stampede would happen during the same kind of deploy. The same auth provider would have a region-pair outage and we'd page the same engineer.

The pretty postmortems were not stopping incidents. We changed how we wrote them, and the rate of repeats finally started dropping. The new ones are uglier and longer. They work better.

What was broken about the old format #

Two things, mostly.

First, the action items table was a wishlist. It was full of "investigate X," "consider doing Y," "should evaluate Z." Items lived in Jira forever and got closed at end-of-quarter clean-up sweeps. Nobody felt accountable to any single item; they were team-wide aspirations.

Second, the Five Whys converged on tidy single-cause explanations. "Why did the cache stampede? Because cache invalidation isn't atomic. Why isn't it atomic? Because we use Redis pub/sub. Why...?" By the time you got to the fifth Why you were at "we chose Redis in 2019" and you couldn't really act on it.

Real production failures aren't shaped like a single tree of causes. They're shaped like a swiss cheese model — lots of holes, all of which had to align. The Five Whys methodology forced us to pretend they were tree-shaped, which made the action items shallower than they should have been.

The new shape #

Our postmortem template is now ugly on purpose. It looks like this:

code

# Incident YYYY-MM-DD: [short customer-visible title]

## Customer impact
[How many customers, for how long, how did it manifest, how did they tell us]

## Detection
[What alerted us, how long after the start of the incident, what did the alert say]

## Timeline
[Each line is a UTC timestamp + a fact + a person. No commentary in the timeline itself]

## What broke (technical)
[The actual technical chain. Multiple contributing causes are expected]

## What helped
[Things that reduced impact or sped up recovery — credit to the system, not blame]

## What hurt
[Things that prolonged or worsened the incident — facts, no judgement]

## Repeat-likelihood checklist
[Specific questions, answered]

## Action items
[Each with owner, due date, exit criteria]

## Sign-off
[A list of names. Not "approved" — read]

The two sections that did most of the work are the Repeat-likelihood checklist and the Action items format. I'll go into both.

The repeat-likelihood checklist #

When we spent time studying our previous postmortems, we noticed that the same shapes of incident kept recurring. We made a checklist that explicitly asks about each of the recurring patterns. Every postmortem now has to answer all of them in writing:

Was there a deploy in the 60 minutes before this fired? Whose? Was it rolled back?
Could this have been caught by a different alert, set differently? Specifically what setting?
Did any runbook step actually apply? Which one? Did it work?
Did this incident depend on a single human being available? Who? What if they hadn't been?
Was any information from a previous postmortem relevant? Which one? Did anyone re-read it?
Could a feature flag, kill switch, or config toggle have stopped this faster? Did we use one? Why not?

These questions are the same every time. They feel mechanical. They are. The point is to stop the postmortem author from drifting into novel theorising and force them to compare the current incident against the patterns we've already seen.

Two recent examples of how this caught repeats:

We had an outage caused by a config push to a feature flag service. The postmortem author answered "Was this caused by a config push?" honestly: yes. Then the next question — "Have we had an incident from a config push before?" — yes, two of them in the last year. The action items shifted from "improve our alerting on this service" to "config pushes need the same canary-and-rollback discipline as code pushes." We built that. Repeats stopped.
A network partition caused a service to fail open in a way that processed transactions twice. The author noted: "The duplicate-handling test we wrote after Incident 2024-08-14 didn't run because of an unrelated CI flake the previous night." We hadn't been re-running flaked tests; the gap had been there for weeks. Two action items: re-run all flaked tests automatically, and add a daily monitor for the test in question.

The checklist also spotted things that had ALREADY been action items in previous postmortems and never finished. We added "is there an open action item from a previous postmortem that, if completed, would have prevented this?" That's been the most uncomfortable question to answer, and the most useful.

Action items with exit criteria #

Old format:

Investigate adding circuit breaker to payments service. (Owner: payments team)

New format:

Add a circuit breaker to the payments service that opens when the upstream provider's error rate is > 5% for 30 seconds, half-opens after 60 seconds, and closes on 5 consecutive successes. Owner: Sara. Due: 2026-02-14. Exit criteria: PR merged, chaos test simulating upstream failure shows the breaker activates within 35 seconds, dashboard payments-circuit-breaker shows the metric.

The exit criteria are the lever. "Done" used to mean "the Jira ticket is closed." Now it means a reviewer can verify that a specific observable thing exists. We close action items in pairs: the engineer who did the work, and a reviewer who confirms exit criteria are met.

We also made the cadence stricter. Every action item has a real due date, max 6 weeks out for severity-1 incidents. Action items past due get escalated to the team's manager. We had to do this twice before behaviour changed; we haven't had to recently.

What we ditched #

The Five Whys section. We replaced it with "What broke (technical)" and "What hurt." The shift to a swiss-cheese framing has been more useful for the engineering team.
The "blameless" preamble. We still are blameless in practice, but we noticed that explicitly framing it created defensiveness. Removing the framing and just acting blameless worked better.
Estimated $ impact. We tried to compute this for a while; the numbers were always disputable, and the conversation got dragged into accounting. We replaced it with "users affected, duration, severity tier" — facts we agree on.

What we kept and tightened #

Customer impact at the top. Always first. Forces the author to think about who was hurt before getting into the engineering autopsy.
Detection time. The metric we track most aggressively. If we can't detect, we can't respond. We've cut median detection time from 12 minutes to 3 over the year, mostly by improving the alerts that came out of postmortem action items.
Timeline as facts only. No "we believe" or "I think." Facts and timestamps. Interpretation goes elsewhere in the doc.

Cadence #

Every severity-1 has a postmortem within 5 business days, presented to the wider engineering team within 10. Severity-2s get a 1-page version. Severity-3s get a Jira note and no full postmortem unless the same thing happens twice in 90 days, in which case it gets promoted to a real one.

The "promote on repeat" rule has caught two infestations of low-severity issues that had been flying under the radar.

The numbers that mattered to us #

We track three things, all backed by the postmortem trail:

Repeat rate: percentage of severity-1 incidents in a quarter that share a root cause with one in the prior 12 months. Was 35-40% with the old format. Last two quarters: 12% and 9%.
Time-to-action-complete: median days from postmortem published to all action items closed-with-criteria. Was around 60 days. Now ~21.
Action items per postmortem: not a goal, but worth watching. Holding steady at 4-6 per severity-1; we don't add filler items.

What this isn't #

This isn't a process that scales well to teams that don't already have working incident response. If your detection is broken, no postmortem template will fix it. If your team doesn't have time to do the work after the document is written, the document doesn't matter.

It also isn't a substitute for engineering investment in reliability. The postmortems surface the right problems; the team has to choose to fix them. When budget gets tight and reliability work gets deferred, the postmortems start documenting the same thing repeatedly. That's a leadership signal, not a process problem.

What I'd tell a new team #

Pick the three patterns you keep repeating. Write them down. Add specific questions about those patterns to your postmortem template. Force every postmortem to answer them. The first month is awkward; by month three the template starts catching repeats before they make it to the action items.

The pretty postmortem template is a trap. The one that helps is ugly and a bit annoying. That's how you know it's doing work.

Incident Postmortems That Actually Prevent Repeat Failures

Incident Postmortems That Actually Prevent Repeat Failures

What was broken about the old format #

The new shape #

The repeat-likelihood checklist #

Action items with exit criteria #

What we ditched #

What we kept and tightened #

Cadence #

The numbers that mattered to us #

What this isn't #

What I'd tell a new team #

Stay Updated

Terraform Modules Done Right: Lessons from Managing 50+ Services

Secrets Management in Practice: From .env Files to Vault

More from DevOps

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

CI Pipeline Caching That Actually Pays Off

Kubernetes Pod Disruption Budgets — Surviving Node Drains Without an Outage

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

CI Pipeline Caching That Actually Pays Off

Kubernetes NetworkPolicies in Practice

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas