Wikis rot. We moved every operational doc into the repo it describes. Six months in, the docs are mostly correct because the only people who can update them are the ones who change the system.
Two years ago we had documentation in four places: Confluence (the largest, most stale), a "engineering-handbook" GitHub repo (kept up by one person who'd recently left), individual repo READMEs (varying quality), and Slack pinned messages (sometimes the most accurate, ironically). The result was that engineers looking for "how do I do X" usually couldn't find the right doc, or found multiple conflicting docs.
This post is about the move to making infrastructure documentation live alongside the code that creates the infrastructure. Six months in, the docs are mostly correct because the only people who can update them are the ones who change the system, and they have to.
Three places, by category:
docs/runbook.md.infra repo under docs/.engineering-handbook, also in markdown.That's it. Confluence still exists for non-engineering content (HR, legal, meeting notes), but nothing engineering-relevant lives there anymore.
Documentation and the system it documents drift if they're maintained by different people in different tools. The closer the doc is to the code, the smaller the gap. If "update the runbook" is the next line in the same PR that changes the system, the runbook stays correct. If it's a separate Confluence edit that needs to happen tomorrow, it doesn't.
So: the doc lives in the repo, gets updated in the same PR as the code change, and shows up in code review.
Every service repo has a docs/runbook.md with a fixed structure:
# Runbook: <service-name>
## What this service does
One-paragraph summary. New on-call should understand within 30 seconds.
## Owner
Team: <team-name>. Slack: <#channel>. PagerDuty rotation: <link>.
## Dependencies
- Depends on: <list of upstream services + their critical paths>
- Depended on by: <list of downstream services>
- External dependencies: <list of third-party APIs, with status pages>
## Alerts and what to do
### <Alert name 1>
**What it means**: <plain-English explanation>
**Customer impact**: <when this fires, what does the user see>
**First-5-minute steps**:
1. <specific command or link>
2. <specific command or link>
**Common causes**: <last 6 months of incidents>
### <Alert name 2>
... etc
## Common operations
- How to deploy: <link to deploy.sh + notes>
- How to rollback: <specific steps>
- How to scale up/down: <command>
- How to enable verbose logging: <command>
## Architecture notes
Anything non-obvious about the design that on-call would need to know.
## Known gotchas
The footguns. The "if you do X, also do Y" rules.
The structure is fixed across all services so on-call engineers know where to look. The "Alerts and what to do" section is the most-read part during incidents; we put it near the top.
Three forces, each modest, that compound:
PR template asks. When you submit a PR that touches app/ or Dockerfile or infrastructure/, the PR template includes a checkbox: "I have updated docs/runbook.md if applicable." It's not auto-enforced, but the box is there.
Quarterly drill. Once a quarter, the on-call rotation runs through every alert in every runbook and asks: "If this fired right now, would the runbook lead me to the right action?" Anything that doesn't gets a Jira ticket against the owning team.
Last-modified shaming. We have a tiny script that runs weekly and lists every runbook that hasn't been touched in 90+ days. It posts to a Slack channel. There's no consequence beyond visibility — but visibility is enough to nudge the team to either update or explicitly mark "this runbook is stable, no changes needed" with a date.
Things that span multiple services live in infra/docs/:
network-design.md: VPC layout, subnet purposes, security group conventionsiam-model.md: how roles are structured, who has whatobservability.md: metrics naming conventions, log formats, dashboard ownershipincident-response.md: severity definitions, communication templates, escalation pathscloud-account-layout.md: which account is which, what lives whereThese are the docs that a new engineer reads in their first week. They change less frequently than service runbooks but matter more for the big picture.
We use mkdocs to render them as a browsable site (deployed to a static hosting bucket gated behind SSO). The same markdown source serves both GitHub viewing and the rendered site. People read whichever is convenient.
The shift from "docs are someone else's job" to "docs are part of the change" took about two months. The first month everyone said "yes I'll update the docs" and didn't. The second month, code reviewers started leaving comments like "you didn't update the runbook for the new alert" and PRs sat until the docs were updated. By month three, it was internalized.
The forcing function was code review. Without reviewers asking, the docs would have drifted again. With reviewers asking, the doc updates landed in the same PR and the gap stayed small.
A few metrics, more for amusement than enforcement:
These aren't strict KPIs but they're the trend we wanted.
People starting with this approach hit a few patterns:
Documenting too much. A runbook that describes every internal data structure of a service is unreadable during an incident. The runbook is for operations, not architecture. Keep it short.
Documenting too little. "See the source code" isn't a runbook. The runbook should let an on-call engineer who has never looked at the service take useful action.
Letting one person own all the runbook updates. That person becomes a bottleneck and burns out. Distributing the responsibility (one team per service) is essential.
Using fancy doc generators that nobody on the team understands. mkdocs is fine. Sphinx is fine. Pick something the team already knows. Don't introduce a new tool just for documentation.
This isn't a replacement for tribal knowledge transfer. The runbook tells you what to do when an alert fires; pairing with a senior engineer tells you why. Both are needed.
This isn't a substitute for proper incident response training. The runbook helps during the moment; the training is what builds the engineer's ability to handle the situation when the runbook isn't quite right.
This isn't going to fix a culture where documentation is genuinely undervalued. If the team doesn't see writing docs as part of the work, no tooling fixes that. The shift from Confluence to git-tracked markdown helped us, but the deeper shift was treating docs as code: review, version, maintain.
Pick the worst-documented service. Put a runbook template in its repo. Have one engineer fill it out — pairing with the team that owns the service. The first runbook is the slowest. Subsequent ones get easier because the template is established.
Don't try to migrate everything from your wiki. The wiki has 90% bit-rot anyway. Migrate the 10% that's still correct, and let the rest die. The new docs are written from current truth, not from a stale source.
Tie doc updates to PR review for the corresponding service. Make it cultural. The technical part of this is trivial; the cultural part is everything.
The first time on-call thanks you for the runbook saving them at 3am — that's when you know it stuck.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.