AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.

On this page

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

For most of 2025 every vendor pitch involved AI agents that "triage your incidents" or "auto-remediate." We've been running this in production for about a year now — agents that can read logs, query metrics, summarize incident state, and (for a narrow set of cases) take remediation actions. The results are messier than the demos.

This post is where it works, where it doesn't, and the patterns we've kept.

The mental model #

The hype version: "AI agents handle incidents automatically." The reality, after a year of building:

Agents are great at summarization and context-gathering.
They're acceptable at suggesting next steps.
They're risky at taking actions — especially mutating ones.
They're consistently bad at reasoning about novel failure modes.

The shape of agentic ops we run reflects this: agents are read-mostly assistants that compress a lot of investigation work into seconds. Action-taking is gated, narrow, and reversible.

Where agents work well #

Initial context-gathering when a page fires. Within 30 seconds of an alert, the agent has:

Pulled the last 100 log lines from the affected service
Queried the metrics dashboard for the relevant time window
Listed recent deploys (likely correlation candidates)
Pulled the related runbook
Drafted a summary in the incident channel

The on-call responder gets paged AND simultaneously gets a context-rich starting point. This compresses the first 5 minutes of every incident into the agent's response time. Real productivity gain. Zero risk because it's all reading.

Cross-referencing past incidents. "Have we seen this error message before?" — the agent searches incident postmortems and past Slack threads, surfaces the 3 most similar. Saves "I feel like we've debugged this before but can't find it" time.

Routine status updates. During a multi-hour incident, the agent watches the relevant dashboards and writes a status update every 15 minutes summarizing what's changed. Frees the incident commander to focus on remediation instead of comms.

Drafting incident postmortems. After resolution, the agent pulls the Slack thread, the alert history, the deploy timeline, and drafts the first version of the postmortem. Humans edit and add interpretation. Cuts postmortem write-up time by ~70%.

These are all low-risk, high-value uses. Reading + summarizing. The agent can't break anything by being wrong; it can just save time.

Where agents are acceptable with guardrails #

Suggesting remediation steps. When a known signature appears (e.g. "database connection pool exhausted"), the agent suggests known remediations from the runbook. Humans decide whether to execute. Quality is OK — agent suggestions match runbook 80%+ of the time, sometimes adds value by combining patterns. Not autonomous, just a smart cheat-sheet.

Targeted action with explicit confirmation. "Roll back the last deploy?" The agent proposes the action, shows the diff of what would change, waits for a human "yes" in the incident channel. We use this for a small set of well-understood rollback patterns. Useful when the on-call is fast-typing on phone at 3am — the agent's prep saves keystrokes.

The key word: explicit confirmation. The agent doesn't act on its own analysis. A human approves each action, knowing exactly what it'll do.

Where agents backfire #

Autonomous remediation of unfamiliar failures. We tried this. The agent took a remediation action based on misreading a metric; the action made the incident worse; we had to revert the agent's revert. The category of incidents that's actually hard is the one where the situation is genuinely novel — the exact case where the agent has the lowest-quality reasoning.

Auto-paging based on agent analysis. "The agent thinks this metric is concerning; page the on-call." Generated paging noise without corresponding signal quality. Humans got paged for non-issues; the trust in the page eroded. Reverted.

Long-running multi-step plans. "Plan a rollback strategy and execute it" with no human checkpoints. The agent goes off-script in unexpected ways; intermediate state can be hard to recover. We cap agent action loops at one step + confirmation; anything more requires explicit re-planning.

Replacing on-call. This was a fantasy we briefly entertained. On-call isn't just running runbooks; it's noticing what's not in the runbook, recognizing patterns that haven't been documented, and making judgment calls under pressure. Agents don't do these well.

What we run today #

After a year of trial:

Investigation agent. Reads logs, metrics, deploy history. Comments in the incident channel with context. Always read-only. Activated automatically on page.

Postmortem drafting agent. Generates the first draft of the postmortem after incident resolution. Human edits.

Confirmation-gated rollback action. For a narrow set of patterns ("most recent deploy + correlated error spike"), suggests rollback with diff; human confirms.

Status update bot. During incidents, posts hourly summaries based on dashboard state. Helpful for executive/stakeholder comms.

That's it. Four narrow tools. Each is low-risk; each saves real time. None is autonomous in the sense vendors mean.

The infrastructure #

Tool definitions. Each capability the agent has is a narrowly-defined tool. get_recent_logs(service, minutes). get_metric_value(query). propose_rollback(service, target_version). No general-purpose "execute SQL" or "run shell command."

Read vs write separation. Read tools (logs, metrics, queries) are exposed broadly. Write tools (rollback, scale, restart) require human confirmation and are scoped to specific, well-tested patterns.

Audit log. Every tool call the agent makes is logged. Postmortem-grade traceability.

Token budgets. Per-incident, the agent has a token budget; it can't loop indefinitely. Prevents runaway costs from a stuck investigation.

These are the same patterns from our broader AI agent tool design post — applied specifically to ops.

The unspoken concern: trust drift #

Six months in, a subtle thing happens: the team gets used to the agent's summaries and starts trusting them implicitly. When the agent's summary is wrong (it happens), the on-call responder might miss it because they didn't independently verify.

We've seen one incident where the agent's misreading of a metric led the responder down the wrong path for 15 minutes. Not a disaster, but instructive.

Counter-discipline: the agent's summaries explicitly cite sources. "Logs show X (link to log stream)." "Metric is Y (link to dashboard)." The responder can click through. We discourage relying purely on the summary; the citations make verification trivial.

What I'd tell a team starting #

Start with read-only context gathering. That's where the high-ROI low-risk wins are.
Be explicit about write boundaries. Never autonomous mutation on unfamiliar incidents.
Cite sources in summaries. Without citations, agents' confidence becomes the operator's blind spot.
Don't replace on-call. Augment.
Postmortem the agent's behavior too. Each incident, did the agent help or hinder? Adjust.

What to read next #

AI agent tool design — boundaries and confirmations — the general pattern for tool design with agents
Incident postmortems that actually prevent repeat failures — the discipline the agent helps with
Practical guide: incident response for platform teams — the broader response model
Multi-agent AI systems — building collaborative AI applications — adjacent agentic patterns

Agentic ops is one of those areas where the demo and the production reality differ. The patterns that hold up are the boring, gated, read-mostly ones. The flashy autonomous-remediation pitches are mostly aspirational. The good news is the boring patterns deliver real value; the team gets faster incident response without giving up oversight.

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

Agentic Ops — When (and When Not) to Use AI Agents for Incident Response

The mental model #

Where agents work well #

Where agents are acceptable with guardrails #

Where agents backfire #

What we run today #

The infrastructure #

The unspoken concern: trust drift #

What I'd tell a team starting #

What to read next #

Stay Updated

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

Kafka Partition Strategies — Scaling Consumers Without Reshuffling Everything

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes