Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.
If you're on a platform team, your incidents have a different shape than service-team incidents. When a service team's incident hits, one team's product degrades. When a platform team's incident hits, every team's product degrades — sometimes simultaneously. The blast radius is the entire engineering org.
This guide is the playbook our platform team uses for that class of incident. Six months in, our mean-time-to-recovery on platform-level incidents has dropped from ~38 minutes to ~12. Below is what changed.
The systems we're responsible for: shared Kubernetes clusters, the artifact registry, the CI/CD platform itself, internal DNS, the secrets manager, the company-wide observability stack, IAM/SSO. When any of these degrade, multiple service teams can't ship, can't observe, or can't authenticate.
Platform incidents have three properties that make them harder:
The playbook below is tuned for these.
The instant a platform incident is suspected, the on-call engineer:
INC-2026-04-26-001 acknowledged. Investigating <symptom>. Updates in this thread.The pinned message is the canonical source for the rest of the incident. Service teams report into the thread; the platform team responds in the thread. Cross-channel chatter is asked to be redirected.
This single change — channeling all incident communication into one thread — saved us probably the most time. Before, the platform team would have 5 separate Slack threads going, each with different teams asking the same questions, and the actual debugging was happening between Zoom and the threads.
Within the first 5 minutes, three roles are claimed:
For small incidents this collapses to one or two people. For anything spanning more than 30 minutes or affecting more than 2 service teams, all three roles are claimed explicitly.
The claiming is important. Otherwise the IC role implicitly falls to whoever spoke first, and the engineer doing the technical work also gets stuck answering "any update?" questions every two minutes.
Platform incidents often have a "is the cause X or Y?" question early on. Our dependency map lives in infra/docs/incident-response.md and shows the upstream/downstream relationships of every platform component:
[ Vault ]
↑
├─ [ K8s clusters ]
↑ ↑
│ ├─ [ Service workloads ]
│ └─ [ Argo CD ]
↑
└─ [ CI deploys (signed image verification) ]
[ Internal DNS ]
↑
├─ [ Vault discovery ]
├─ [ K8s API endpoints ]
└─ [ Service-to-service calls ]
When CI starts failing AND service workloads start failing AND it's not the Argo CD's release window — Vault is a strong suspect. The map turns "what could be wrong" from a search through unknowns into a structured triage.
The hardest discipline. Service teams go quiet when they don't get updates, then panic-message later. Steady updates — even "we don't know yet, will update in 10 minutes" — are better than silence.
We post updates every 15 minutes during an active incident. The template is:
[15:42 UTC] STATUS UPDATE — INC-2026-04-26-001
Symptom: Vault returning 503 on auth endpoint
Impact: New pod deploys are failing across all clusters; existing pods unaffected
Investigation: Examining Vault HCP control plane status
Next update: 15:57 UTC or sooner if state changes
Even when there's no progress, the update goes out. It tells service teams: "we know, we're on it, you don't need to chase us."
The actual technical work is, of course, technical. A few patterns we've noticed help:
When the immediate symptom is resolved:
We don't declare the all-clear at the moment of fix. We wait 60 minutes of clean metrics. About once a quarter, an aftershock during this window catches a partial fix that wasn't fully resolved.
The postmortem is the longer-form artifact (we have a separate post on the postmortem template). For platform incidents, two specific things go in:
The communication review is where we've improved most. Each postmortem catches something — "we forgot to update statuspage during the recovery phase" / "the dependency map was missing the link between X and Y" / "the on-call engineer's first instinct was to fix instead of communicate; we lost 12 minutes of comms." Each issue gets a small action item.
A few patterns we've consciously dropped:
Updating individual teams in DM. The first 30 minutes of a big incident used to be 90% communication and 10% fixing. Channeling to a single thread reverses that ratio.
Letting the SME also be the IC. The SME is busy. The IC needs to drive comms, decisions, and external coordination. One person can't do both well during a real incident.
Skipping read-only diagnosis. Twice in the past, we've made an incident worse by acting on a misdiagnosis. Read-only first is now muscle memory.
Declaring resolved at the moment of fix. Aftershocks happen. The 60-minute observation window prevents us from declaring victory prematurely.
Three numbers per incident, tracked over time:
The MTTR metric (time to resolution) gets the attention. The "time to communication" metric is what we tightened most aggressively — service teams care about being told what's happening as much as the fix itself.
Most platform incidents are caused by us. Not by AWS, not by upstream providers, not by the universe. Almost all of them are deploys we did, configuration changes we made, or capacity planning we got wrong. The dependency map is helpful for diagnosing external issues; for internal ones, "what did we change in the last hour" is usually the answer.
Service teams are kinder than we expected. When platform-level incidents happen, the natural fear is that service teams will be hostile. They're not — they're stressed because their thing is broken, but a steady stream of updates and a clear "we're on it" calms them. The bad case is silence, not transparency.
Practice helps a lot. We do a quarterly platform-incident drill (a synthetic outage of one component, full incident response triggered, no production impact). The drill participants always learn something — usually about runbook gaps. The drills have probably done as much for our MTTR as any tooling change.
Pick the three platform components most likely to fail and write their incident playbooks first. Don't try to write playbooks for everything.
Train the IC role separately from the SME role. They use different parts of the brain. The best ICs we've had were not necessarily our most technical engineers — they were the ones who could keep their head clear while six things were going wrong.
Run drills. The first drill is awkward and reveals everything wrong. The second is much smoother. By the fourth, you've internalized the patterns.
The single highest-leverage improvement is the discipline of channeling everything into one thread and posting updates every 15 minutes. It feels mechanical; it transforms how the rest of the incident goes.
Platform incidents are uncomfortable because the blast radius is large and the visibility is intense. The reflex is to put your head down and fix. The instinct is wrong: a fix without communication generates more pain than communication without a fix. Once a platform team internalizes that, the incidents get less stressful even if they're still hard.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We had three months of slow drift between our Terraform code and AWS reality. Here's the daily-cron + Slack workflow that closed the gap.
We had Datadog for app metrics, Loki for logs, and zero useful insight into what our LLM service was actually doing. Here's the observability stack we built specifically for model serving.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.