Platform teams own the systems that EVERY service depends on. Our incident response playbook for when the foundation cracks.

On this page

Incident Response for Platform Teams: A Practical Guide

If you're on a platform team, your incidents have a different shape than service-team incidents. When a service team's incident hits, one team's product degrades. When a platform team's incident hits, every team's product degrades — sometimes simultaneously. The blast radius is the entire engineering org.

This guide is the playbook our platform team uses for that class of incident. Six months in, our mean-time-to-recovery on platform-level incidents has dropped from ~38 minutes to ~12. Below is what changed.

What "platform-level" means here #

The systems we're responsible for: shared Kubernetes clusters, the artifact registry, the CI/CD platform itself, internal DNS, the secrets manager, the company-wide observability stack, IAM/SSO. When any of these degrade, multiple service teams can't ship, can't observe, or can't authenticate.

Platform incidents have three properties that make them harder:

Many simultaneous reporters. Slack lights up across teams within minutes.
Shared dependencies. Fixing one symptom often requires understanding what's breaking in three other systems.
High coordination cost. Every service team wants updates; the platform team is busy fixing the underlying issue.

The playbook below is tuned for these.

Step 1: Acknowledge and gate the noise #

The instant a platform incident is suspected, the on-call engineer:

Acknowledges the alert in PagerDuty
Posts a single message in the platform-incidents Slack channel: INC-2026-04-26-001 acknowledged. Investigating <symptom>. Updates in this thread.
Pins that message
Updates statuspage.io with a placeholder ("investigating") within 5 minutes

The pinned message is the canonical source for the rest of the incident. Service teams report into the thread; the platform team responds in the thread. Cross-channel chatter is asked to be redirected.

This single change — channeling all incident communication into one thread — saved us probably the most time. Before, the platform team would have 5 separate Slack threads going, each with different teams asking the same questions, and the actual debugging was happening between Zoom and the threads.

Step 2: Form the team explicitly #

Within the first 5 minutes, three roles are claimed:

Incident commander (IC): drives the incident, decides actions, owns external communication. Often NOT the engineer doing the technical work.
Subject matter expert (SME): the person actually fixing the thing. Could rotate as the incident's root cause becomes clearer.
Communications lead: handles statuspage updates, Slack response, and cross-team comms. Frees the IC and SME to focus.

For small incidents this collapses to one or two people. For anything spanning more than 30 minutes or affecting more than 2 service teams, all three roles are claimed explicitly.

The claiming is important. Otherwise the IC role implicitly falls to whoever spoke first, and the engineer doing the technical work also gets stuck answering "any update?" questions every two minutes.

Step 3: Triage with the dependency map #

Platform incidents often have a "is the cause X or Y?" question early on. Our dependency map lives in infra/docs/incident-response.md and shows the upstream/downstream relationships of every platform component:

code

[ Vault ]
    ↑
    ├─ [ K8s clusters ]
    ↑       ↑
    │       ├─ [ Service workloads ]
    │       └─ [ Argo CD ]
    ↑
    └─ [ CI deploys (signed image verification) ]

[ Internal DNS ]
    ↑
    ├─ [ Vault discovery ]
    ├─ [ K8s API endpoints ]
    └─ [ Service-to-service calls ]

When CI starts failing AND service workloads start failing AND it's not the Argo CD's release window — Vault is a strong suspect. The map turns "what could be wrong" from a search through unknowns into a structured triage.

Step 4: Communicate proactively, even when you don't know #

The hardest discipline. Service teams go quiet when they don't get updates, then panic-message later. Steady updates — even "we don't know yet, will update in 10 minutes" — are better than silence.

We post updates every 15 minutes during an active incident. The template is:

code

[15:42 UTC] STATUS UPDATE — INC-2026-04-26-001

Symptom: Vault returning 503 on auth endpoint
Impact: New pod deploys are failing across all clusters; existing pods unaffected
Investigation: Examining Vault HCP control plane status
Next update: 15:57 UTC or sooner if state changes

Even when there's no progress, the update goes out. It tells service teams: "we know, we're on it, you don't need to chase us."

Step 5: The fixing itself #

The actual technical work is, of course, technical. A few patterns we've noticed help:

One-engineer-at-the-keyboard. Multiple engineers running commands creates conflicts. The SME is at the keyboard; others suggest, but the SME runs the commands. This is faster than it sounds.
Read-only first. Before any change action, the SME runs read-only commands to confirm the diagnosis. If the diagnosis is wrong, a write action makes things worse.
Announce write actions. "I'm about to restart the Vault primary, expect 30s of full downtime" — gives the IC and comms a chance to alert service teams before disruption.
Reverse if uncertain. If a remediation hasn't helped in 5 minutes, reverse it before trying the next thing. Stacked uncertain fixes are how incidents get worse.

Step 6: Recovery and the all-clear #

When the immediate symptom is resolved:

SME confirms via metrics that the issue is gone (e.g., "auth success rate back to baseline")
IC posts an "issue resolved" update in the thread
Comms updates statuspage to "resolved"
The thread stays open for 1 hour to catch any aftershocks

We don't declare the all-clear at the moment of fix. We wait 60 minutes of clean metrics. About once a quarter, an aftershock during this window catches a partial fix that wasn't fully resolved.

Step 7: Postmortem (within 5 business days)#

The postmortem is the longer-form artifact (we have a separate post on the postmortem template). For platform incidents, two specific things go in:

Cross-team impact: which service teams were affected, for how long, what they saw.
Communication review: what worked in our cross-team comms, what didn't, what to change for next time.

The communication review is where we've improved most. Each postmortem catches something — "we forgot to update statuspage during the recovery phase" / "the dependency map was missing the link between X and Y" / "the on-call engineer's first instinct was to fix instead of communicate; we lost 12 minutes of comms." Each issue gets a small action item.

What we got wrong before #

A few patterns we've consciously dropped:

Updating individual teams in DM. The first 30 minutes of a big incident used to be 90% communication and 10% fixing. Channeling to a single thread reverses that ratio.

Letting the SME also be the IC. The SME is busy. The IC needs to drive comms, decisions, and external coordination. One person can't do both well during a real incident.

Skipping read-only diagnosis. Twice in the past, we've made an incident worse by acting on a misdiagnosis. Read-only first is now muscle memory.

Declaring resolved at the moment of fix. Aftershocks happen. The 60-minute observation window prevents us from declaring victory prematurely.

What we measure #

Three numbers per incident, tracked over time:

Time to acknowledge: alert fires → on-call posts in channel
Time to resolution: alert fires → "all clear" announcement
Time to communication: alert fires → first statuspage update

The MTTR metric (time to resolution) gets the attention. The "time to communication" metric is what we tightened most aggressively — service teams care about being told what's happening as much as the fix itself.

What surprised us #

Most platform incidents are caused by us. Not by AWS, not by upstream providers, not by the universe. Almost all of them are deploys we did, configuration changes we made, or capacity planning we got wrong. The dependency map is helpful for diagnosing external issues; for internal ones, "what did we change in the last hour" is usually the answer.

Service teams are kinder than we expected. When platform-level incidents happen, the natural fear is that service teams will be hostile. They're not — they're stressed because their thing is broken, but a steady stream of updates and a clear "we're on it" calms them. The bad case is silence, not transparency.

Practice helps a lot. We do a quarterly platform-incident drill (a synthetic outage of one component, full incident response triggered, no production impact). The drill participants always learn something — usually about runbook gaps. The drills have probably done as much for our MTTR as any tooling change.

What I'd tell a platform team starting on this #

Pick the three platform components most likely to fail and write their incident playbooks first. Don't try to write playbooks for everything.

Train the IC role separately from the SME role. They use different parts of the brain. The best ICs we've had were not necessarily our most technical engineers — they were the ones who could keep their head clear while six things were going wrong.

Run drills. The first drill is awkward and reveals everything wrong. The second is much smoother. By the fourth, you've internalized the patterns.

The single highest-leverage improvement is the discipline of channeling everything into one thread and posting updates every 15 minutes. It feels mechanical; it transforms how the rest of the incident goes.

Closing thought #

Platform incidents are uncomfortable because the blast radius is large and the visibility is intense. The reflex is to put your head down and fix. The instinct is wrong: a fix without communication generates more pain than communication without a fix. Once a platform team internalizes that, the incidents get less stressful even if they're still hard.

Practical Guide: Incident Response for Platform Teams

Incident Response for Platform Teams: A Practical Guide

What "platform-level" means here #

Step 1: Acknowledge and gate the noise #

Step 2: Form the team explicitly #

Step 3: Triage with the dependency map #

Step 4: Communicate proactively, even when you don't know #

Step 5: The fixing itself #

Step 6: Recovery and the all-clear #

Step 7: Postmortem (within 5 business days)#

What we got wrong before #

What we measure #

What surprised us #

What I'd tell a platform team starting on this #

Closing thought #

Stay Updated

Practical Guide: Infrastructure Drift Detection Workflow

Deep Dive: Model Serving Observability Stack

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas