FREE TEMPLATE

AI Agent Incident Response Runbook

When your agent fails in production, you need a plan — not panic. This runbook template gives your team a step-by-step playbook for classifying incidents, activating kill switches, communicating with stakeholders, and recovering gracefully.

Section 1: Severity Classification

The first step in any incident is classification. Getting the severity right determines who gets paged, how fast you respond, and what resources are allocated. Misclassify and you either over-react (wasting time) or under-react (causing damage).

P0 — Critical

  • Definition: Agent is causing active harm, data loss, unauthorized actions, or safety violations. Users are being harmed or company reputation is at immediate risk
  • Response time: Immediate — drop everything, all hands on deck
  • First action: Activate kill switch immediately. Stop the agent from processing any new requests before doing anything else
  • Examples: Agent leaking PII in responses, executing unauthorized transactions, providing dangerous medical/legal advice, prompt injection causing data exfiltration
  • Escalation: Engineering lead, VP of Engineering, Legal, and Communications are all notified within 15 minutes

P1 — High

  • Definition: Agent is producing incorrect, misleading, or low-quality outputs at a rate that impacts user trust or business outcomes. No active harm, but significant degradation
  • Response time: Within 1 hour of detection
  • First action: Switch to fallback behavior — static responses, human handoff, or reduced functionality mode while the root cause is investigated
  • Examples: Agent hallucinating facts in 30%+ of responses, wrong tool being called consistently, context window overflowing causing incoherent outputs, API rate limits causing cascading failures
  • Escalation: Engineering lead and product manager notified. Customer support briefed with talking points

P2 — Medium

  • Definition: Degraded performance that users notice but can work around. Agent is functional but slower, less accurate, or missing some capabilities
  • Response time: Within 4 hours of detection
  • First action: Log the issue, create a ticket, and begin investigation. Monitor for escalation to P1
  • Examples: Response latency doubled, one tool integration intermittently failing, agent occasionally losing conversation context, cost per conversation 50% above baseline
  • Escalation: Assigned engineer investigates. Team lead updated at next standup

P3 — Low

  • Definition: Cosmetic issues, minor inconsistencies, or non-critical bugs that don't meaningfully impact user experience or business outcomes
  • Response time: Next business day
  • First action: Create a ticket and add to the backlog. Fix in the next regular release cycle
  • Examples: Formatting inconsistencies in agent responses, minor grammar issues in generated text, non-critical log warnings, unused tool parameters
  • Escalation: No escalation needed. Standard ticket workflow

Section 2: Immediate Response Steps

When an incident is detected and classified, follow these five steps in order. Do not skip steps — each one builds on the previous. The goal is to stop the bleeding first, then understand what happened.

1

Acknowledge & Classify

Confirm the incident is real (not a false alarm), assign a severity level using the classification above, and designate an incident commander who owns the response from this point forward. Document the time of detection and initial symptoms.

2

Activate Circuit Breaker

For P0 and P1 incidents, immediately activate the circuit breaker or kill switch. This stops the agent from processing new requests while preserving in-flight conversations for analysis. Every production agent must have a one-command kill switch tested monthly.

3

Switch to Fallback

Activate fallback behavior: static responses for common queries, human handoff for complex requests, or a simplified version of the agent with reduced capabilities. Users should never see a blank page — they should see a degraded but functional experience.

4

Notify Stakeholders

Using the communication templates below, notify the appropriate stakeholders based on severity. For P0: everyone immediately. For P1: engineering and product within 1 hour. For P2: team lead at next standup. Include what happened, current impact, and estimated resolution time.

5

Begin Root Cause Analysis

With the agent stabilized and stakeholders informed, begin investigating the root cause. Pull logs, review traces, check recent deployments, and examine the specific inputs that triggered the failure. Document findings in real-time in the incident channel.

Section 3: Communication Templates

Clear communication during an incident prevents confusion, reduces panic, and keeps everyone aligned. Use these templates as starting points — customize for your organization's tone and structure.

Internal Notification Template

  • Subject: [P0/P1/P2/P3] Agent Incident — [Agent Name] — [Brief Description]
  • Status: Active / Investigating / Mitigated / Resolved
  • Impact: [Number of users affected, business impact, data exposure risk]
  • Current action: [What is being done right now — kill switch activated, fallback deployed, etc.]
  • Estimated resolution: [Best estimate — be honest if unknown, say "investigating"]
  • Incident commander: [Name and contact information]
  • Next update in: [30 minutes for P0, 1 hour for P1, 4 hours for P2]

Customer Communication Template

  • Subject: Service Update — [Agent/Feature Name]
  • Opening: We're aware of an issue affecting [specific feature] and are actively working to resolve it
  • Impact: You may experience [specific symptoms — slower responses, reduced accuracy, temporary unavailability]
  • Workaround: In the meantime, you can [alternative action — contact support directly, use manual process, etc.]
  • Timeline: We expect to have this resolved within [honest estimate]. We'll update you when the service is fully restored
  • Closing: We apologize for the inconvenience and appreciate your patience

Post-Mortem Template Structure

  • Incident summary: One-paragraph description of what happened, when, and who was affected
  • Timeline: Chronological list of events from detection to resolution with timestamps
  • Root cause: Technical explanation of why the incident occurred — not "human error" but the systemic issue
  • Impact assessment: Users affected, duration, financial impact, reputation impact, data exposure (if any)
  • What went well: Parts of the response that worked — detection speed, communication, fallback effectiveness
  • What could be improved: Gaps in monitoring, response time, communication, or documentation
  • Action items: Specific, assigned, time-bound improvements to prevent recurrence

Section 4: Recovery Procedures

Once the root cause is identified and fixed, recovery must be gradual and validated. Rushing back to full operation is how you turn one incident into two.

🔄

Rollback to Last Known Good

Deploy the last known good version of the agent — including system prompt, model version, tool configurations, and infrastructure settings. This should be a single command that's tested monthly. If your rollback process takes more than 5 minutes, it's not production-ready.

📈

Gradual Traffic Restoration

Don't flip the switch back to 100% traffic immediately. Start with 10% of traffic routed to the fixed agent, monitor for 30 minutes, then increase to 25%, 50%, and finally 100%. Each stage must pass your validation checks before proceeding to the next.

Validation Before Full Restoration

Before declaring the incident resolved: run your automated test suite, verify error rates are below baseline, confirm latency is normal, check that all tool integrations are responding, and review a sample of live outputs for quality. Only then can you close the incident.

Download the Runbook Template

Get a customizable version of this runbook in Markdown, Notion, and PDF formats. Pre-filled with the templates above and ready to adapt for your team's specific agent architecture and communication channels.

  • Markdown format for GitHub/GitLab wikis
  • Notion template with database integration
  • Print-ready PDF for physical incident binders
  • Slack message templates for each severity level

Get the Runbook Template

No spam. Unsubscribe anytime. By downloading you agree to our privacy policy.

Choose the Right Framework Before You Build

The runbook prepares you for failures. The Framework Comparison helps you pick the right foundation so you have fewer failures to begin with.