AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
For most of 2025 every vendor pitch involved AI agents that "triage your incidents" or "auto-remediate." We've been running this in production for about a year now — agents that can read logs, query metrics, summarize incident state, and (for a narrow set of cases) take remediation actions. The results are messier than the demos.
This post is where it works, where it doesn't, and the patterns we've kept.
The hype version: "AI agents handle incidents automatically." The reality, after a year of building:
The shape of agentic ops we run reflects this: agents are read-mostly assistants that compress a lot of investigation work into seconds. Action-taking is gated, narrow, and reversible.
Initial context-gathering when a page fires. Within 30 seconds of an alert, the agent has:
The on-call responder gets paged AND simultaneously gets a context-rich starting point. This compresses the first 5 minutes of every incident into the agent's response time. Real productivity gain. Zero risk because it's all reading.
Cross-referencing past incidents. "Have we seen this error message before?" — the agent searches incident postmortems and past Slack threads, surfaces the 3 most similar. Saves "I feel like we've debugged this before but can't find it" time.
Routine status updates. During a multi-hour incident, the agent watches the relevant dashboards and writes a status update every 15 minutes summarizing what's changed. Frees the incident commander to focus on remediation instead of comms.
Drafting incident postmortems. After resolution, the agent pulls the Slack thread, the alert history, the deploy timeline, and drafts the first version of the postmortem. Humans edit and add interpretation. Cuts postmortem write-up time by ~70%.
These are all low-risk, high-value uses. Reading + summarizing. The agent can't break anything by being wrong; it can just save time.
Suggesting remediation steps. When a known signature appears (e.g. "database connection pool exhausted"), the agent suggests known remediations from the runbook. Humans decide whether to execute. Quality is OK — agent suggestions match runbook 80%+ of the time, sometimes adds value by combining patterns. Not autonomous, just a smart cheat-sheet.
Targeted action with explicit confirmation. "Roll back the last deploy?" The agent proposes the action, shows the diff of what would change, waits for a human "yes" in the incident channel. We use this for a small set of well-understood rollback patterns. Useful when the on-call is fast-typing on phone at 3am — the agent's prep saves keystrokes.
The key word: explicit confirmation. The agent doesn't act on its own analysis. A human approves each action, knowing exactly what it'll do.
Autonomous remediation of unfamiliar failures. We tried this. The agent took a remediation action based on misreading a metric; the action made the incident worse; we had to revert the agent's revert. The category of incidents that's actually hard is the one where the situation is genuinely novel — the exact case where the agent has the lowest-quality reasoning.
Auto-paging based on agent analysis. "The agent thinks this metric is concerning; page the on-call." Generated paging noise without corresponding signal quality. Humans got paged for non-issues; the trust in the page eroded. Reverted.
Long-running multi-step plans. "Plan a rollback strategy and execute it" with no human checkpoints. The agent goes off-script in unexpected ways; intermediate state can be hard to recover. We cap agent action loops at one step + confirmation; anything more requires explicit re-planning.
Replacing on-call. This was a fantasy we briefly entertained. On-call isn't just running runbooks; it's noticing what's not in the runbook, recognizing patterns that haven't been documented, and making judgment calls under pressure. Agents don't do these well.
After a year of trial:
Investigation agent. Reads logs, metrics, deploy history. Comments in the incident channel with context. Always read-only. Activated automatically on page.
Postmortem drafting agent. Generates the first draft of the postmortem after incident resolution. Human edits.
Confirmation-gated rollback action. For a narrow set of patterns ("most recent deploy + correlated error spike"), suggests rollback with diff; human confirms.
Status update bot. During incidents, posts hourly summaries based on dashboard state. Helpful for executive/stakeholder comms.
That's it. Four narrow tools. Each is low-risk; each saves real time. None is autonomous in the sense vendors mean.
Tool definitions. Each capability the agent has is a narrowly-defined tool. get_recent_logs(service, minutes). get_metric_value(query). propose_rollback(service, target_version). No general-purpose "execute SQL" or "run shell command."
Read vs write separation. Read tools (logs, metrics, queries) are exposed broadly. Write tools (rollback, scale, restart) require human confirmation and are scoped to specific, well-tested patterns.
Audit log. Every tool call the agent makes is logged. Postmortem-grade traceability.
Token budgets. Per-incident, the agent has a token budget; it can't loop indefinitely. Prevents runaway costs from a stuck investigation.
These are the same patterns from our broader AI agent tool design post — applied specifically to ops.
Six months in, a subtle thing happens: the team gets used to the agent's summaries and starts trusting them implicitly. When the agent's summary is wrong (it happens), the on-call responder might miss it because they didn't independently verify.
We've seen one incident where the agent's misreading of a metric led the responder down the wrong path for 15 minutes. Not a disaster, but instructive.
Counter-discipline: the agent's summaries explicitly cite sources. "Logs show X (link to log stream)." "Metric is Y (link to dashboard)." The responder can click through. We discourage relying purely on the summary; the citations make verification trivial.
Agentic ops is one of those areas where the demo and the production reality differ. The patterns that hold up are the boring, gated, read-mostly ones. The flashy autonomous-remediation pitches are mostly aspirational. The good news is the boring patterns deliver real value; the team gets faster incident response without giving up oversight.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.
Evergreen posts worth revisiting.