We wrote pretty postmortems for two years and kept hitting the same incidents. Here's what changed when we started writing ugly ones.
For about two years our postmortems were beautifully written. Each one had a Five Whys section, a clean timeline, an action items table, and a thoughtful blameless tone. We linked them in the engineering wiki. We presented them at the quarterly review. They were really good documents.
We also kept having the same kinds of incidents. The same database failover would surprise us once a quarter. The same kind of cache stampede would happen during the same kind of deploy. The same auth provider would have a region-pair outage and we'd page the same engineer.
The pretty postmortems were not stopping incidents. We changed how we wrote them, and the rate of repeats finally started dropping. The new ones are uglier and longer. They work better.
Two things, mostly.
First, the action items table was a wishlist. It was full of "investigate X," "consider doing Y," "should evaluate Z." Items lived in Jira forever and got closed at end-of-quarter clean-up sweeps. Nobody felt accountable to any single item; they were team-wide aspirations.
Second, the Five Whys converged on tidy single-cause explanations. "Why did the cache stampede? Because cache invalidation isn't atomic. Why isn't it atomic? Because we use Redis pub/sub. Why...?" By the time you got to the fifth Why you were at "we chose Redis in 2019" and you couldn't really act on it.
Real production failures aren't shaped like a single tree of causes. They're shaped like a swiss cheese model — lots of holes, all of which had to align. The Five Whys methodology forced us to pretend they were tree-shaped, which made the action items shallower than they should have been.
Our postmortem template is now ugly on purpose. It looks like this:
# Incident YYYY-MM-DD: [short customer-visible title]
## Customer impact
[How many customers, for how long, how did it manifest, how did they tell us]
## Detection
[What alerted us, how long after the start of the incident, what did the alert say]
## Timeline
[Each line is a UTC timestamp + a fact + a person. No commentary in the timeline itself]
## What broke (technical)
[The actual technical chain. Multiple contributing causes are expected]
## What helped
[Things that reduced impact or sped up recovery — credit to the system, not blame]
## What hurt
[Things that prolonged or worsened the incident — facts, no judgement]
## Repeat-likelihood checklist
[Specific questions, answered]
## Action items
[Each with owner, due date, exit criteria]
## Sign-off
[A list of names. Not "approved" — read]
The two sections that did most of the work are the Repeat-likelihood checklist and the Action items format. I'll go into both.
When we spent time studying our previous postmortems, we noticed that the same shapes of incident kept recurring. We made a checklist that explicitly asks about each of the recurring patterns. Every postmortem now has to answer all of them in writing:
These questions are the same every time. They feel mechanical. They are. The point is to stop the postmortem author from drifting into novel theorising and force them to compare the current incident against the patterns we've already seen.
Two recent examples of how this caught repeats:
We had an outage caused by a config push to a feature flag service. The postmortem author answered "Was this caused by a config push?" honestly: yes. Then the next question — "Have we had an incident from a config push before?" — yes, two of them in the last year. The action items shifted from "improve our alerting on this service" to "config pushes need the same canary-and-rollback discipline as code pushes." We built that. Repeats stopped.
A network partition caused a service to fail open in a way that processed transactions twice. The author noted: "The duplicate-handling test we wrote after Incident 2024-08-14 didn't run because of an unrelated CI flake the previous night." We hadn't been re-running flaked tests; the gap had been there for weeks. Two action items: re-run all flaked tests automatically, and add a daily monitor for the test in question.
The checklist also spotted things that had ALREADY been action items in previous postmortems and never finished. We added "is there an open action item from a previous postmortem that, if completed, would have prevented this?" That's been the most uncomfortable question to answer, and the most useful.
Old format:
New format:
payments-circuit-breaker shows the metric.The exit criteria are the lever. "Done" used to mean "the Jira ticket is closed." Now it means a reviewer can verify that a specific observable thing exists. We close action items in pairs: the engineer who did the work, and a reviewer who confirms exit criteria are met.
We also made the cadence stricter. Every action item has a real due date, max 6 weeks out for severity-1 incidents. Action items past due get escalated to the team's manager. We had to do this twice before behaviour changed; we haven't had to recently.
Every severity-1 has a postmortem within 5 business days, presented to the wider engineering team within 10. Severity-2s get a 1-page version. Severity-3s get a Jira note and no full postmortem unless the same thing happens twice in 90 days, in which case it gets promoted to a real one.
The "promote on repeat" rule has caught two infestations of low-severity issues that had been flying under the radar.
We track three things, all backed by the postmortem trail:
This isn't a process that scales well to teams that don't already have working incident response. If your detection is broken, no postmortem template will fix it. If your team doesn't have time to do the work after the document is written, the document doesn't matter.
It also isn't a substitute for engineering investment in reliability. The postmortems surface the right problems; the team has to choose to fix them. When budget gets tight and reliability work gets deferred, the postmortems start documenting the same thing repeatedly. That's a leadership signal, not a process problem.
Pick the three patterns you keep repeating. Write them down. Add specific questions about those patterns to your postmortem template. Force every postmortem to answer them. The first month is awkward; by month three the template starts catching repeats before they make it to the action items.
The pretty postmortem template is a trap. The one that helps is ugly and a bit annoying. That's how you know it's doing work.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
We had .env files in three repos, AWS keys in Slack DMs, and a postgres password etched into a Confluence page. Cleaning it up took a sprint and changed how we think about secrets.
Explore more articles in this category
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.
We run three different job queue systems across our services. The patterns that work across all of them, the differences that matter, and the operational gotchas.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.