Version 2026-06 · CC-BY-4.0

AI Incident Response Playbook

14 production AI incident classes — detection signals, immediate triage, communication templates, root cause patterns, prevention updates.

Save this as a PDF: press Ctrl+P (Windows/Linux) or +P (Mac), then choose “Save as PDF”. The page is print-styled for clean output.
How to use this playbook. Companion to the Failure Modes Catalog: failure modes are caught before launch; incidents are what reaches production despite review. For each class: confirm the detection signals match the situation, run the immediate triage steps, fire the communication template (substitute the variables), document the root cause against the listed patterns, file the prevention update so the next instance is caught by a control instead of by a human at 2 AM. Compiled from 150+ engagements where production AI systems experienced incidents requiring documented response.
P0 Critical — page on-call P1 High — mitigate within hours P2 Medium — within business day P3 Low — investigation only
INC-01

LLM cost explosion

P1
Detection Signals
  • Spend metric exceeds daily budget by >2x
  • Token-per-query metric step-change
  • Vendor billing alert
Immediate Triage
  • Activate per-key rate limit at 25% of normal
  • Identify top-N consumers from logs
  • Roll back recent prompt/template changes if correlated
Communication Template
“Cost anomaly detected on LLM service at HH:MM. Throttling in effect; user impact is rate-limited responses. Root cause investigation in progress; ETA for normal capacity within Xh.”
Root Cause Patterns
  • Loose top-k in RAG retrieval
  • Removed reranker by mistake
  • Prompt change ballooning context
  • Abuse / scraping pattern
Prevention Update
Add token-per-query alarm at 1.5x baseline. Enforce per-key rate limit and budget cap. Add reranker required by policy.
INC-02

Hallucination at scale on a specific topic

P1
Detection Signals
  • Customer complaint cluster mentioning same wrong fact
  • Quality eval regression on a topic subset
  • Citation rate drop on retrieved chunks
Immediate Triage
  • Add hard-coded refusal for the topic pending fix
  • Inspect retrieval against ground truth on representative queries
  • Confirm RAG corpus actually contains the correct info
Communication Template
“We have identified incorrect responses on topic X. The assistant is now declining the topic while we fix retrieval. ETA for return to service: Xh.”
Root Cause Patterns
  • Retrieval not surfacing the right chunk
  • Embedding model misalignment on terminology
  • Corpus stale or missing the fact
  • System prompt allowing speculation
Prevention Update
Add golden-set evaluation including the topic. Enforce citation requirement in prompt. Schedule corpus freshness audit.
INC-03

Silent model drift (vendor update)

P2
Detection Signals
  • Eval score drop without code change
  • Output distribution shift
  • Customer feedback shift over a week
Immediate Triage
  • Pin to a known-good model version if vendor allows
  • A/B compare current vs prior version on golden set
  • Roll back affected prompts to the version-pinned model
Communication Template
“Internal: model drift suspected on vendor X. Performance regression of Y% observed. Pinning to prior version while we investigate.”
Root Cause Patterns
  • Vendor silently updated underlying model
  • Vendor deprecated feature relied on by prompt
  • Tokenizer changed under the same model name
Prevention Update
Version-pin every model in production. Subscribe to vendor changelog. Add eval to CI to catch drift.
INC-04

Prompt injection through retrieved content

P0
Detection Signals
  • Anomalous tool calls
  • Output containing data not from current user
  • Suspicious system-prompt leakage in responses
Immediate Triage
  • Disable tool calling for retrieved-content sessions
  • Quarantine the suspect documents
  • Force re-auth on affected sessions
Communication Template
“Internal P0: prompt injection vector detected through document upload. Tools disabled while we investigate scope. No customer notification yet pending impact assessment.”
Root Cause Patterns
  • Retrieved content treated as instruction
  • User-uploaded document not filtered
  • RAG corpus polluted with adversarial entries
Prevention Update
System prompt that forbids following retrieved instructions. Output validation. Content scanning before ingestion. Tool-call allowlist independent of LLM output.
INC-05

PII leak in LLM output

P0
Detection Signals
  • DLP alert
  • Customer report
  • Audit log entry showing PII in response
Immediate Triage
  • Disable affected endpoint
  • Engage legal and privacy officer
  • Identify scope: which users, what data, how many calls
Communication Template
“Internal P0: PII appearance in LLM output identified on endpoint X. Endpoint disabled. Legal engaged. Detailed scope assessment within 4h.”
Root Cause Patterns
  • PII in RAG corpus that should have been redacted
  • Insufficient output filtering
  • Cross-tenant retrieval leak
  • Model trained on PII (vendor side)
Prevention Update
PII scrubbing on ingestion AND on output. Tenant filter at retrieval layer. Output DLP scan. Periodic red-team.
INC-06

Vendor outage

P1
Detection Signals
  • Vendor status page yellow/red
  • Spike in 5xx from vendor API
  • Latency p95 exceeds threshold
Immediate Triage
  • Activate self-hosted fallback OR alternative vendor via gateway
  • Set feature flag to degraded mode for non-critical AI paths
  • Update status page
Communication Template
“Some AI features are running in degraded mode due to upstream vendor outage. Full functionality will return when the vendor recovers; we will update status hourly.”
Root Cause Patterns
  • Vendor incident
  • Regional vendor outage with no multi-region fallback
  • Rate-limit hit because our usage pattern changed
Prevention Update
Gateway-based multi-vendor with automatic failover. Self-hosted fallback for compliance-critical paths. Graceful degradation feature flags.
INC-07

Latency degradation

P2
Detection Signals
  • p95 over SLA for >15 min
  • Queue depth growing
  • Customer satisfaction metric drop
Immediate Triage
  • Add caching layer for hot prompts
  • Shed non-critical traffic
  • Check vendor status; switch region if applicable
Communication Template
“We are seeing higher response times in AI features. Mitigations in effect; normal latency expected within Xh.”
Root Cause Patterns
  • Vendor regional latency
  • Context window grew due to corpus growth
  • Reranker is the bottleneck
  • Cold cache after deploy
Prevention Update
Latency SLO alarms. Hot-prompt cache. Streaming responses where applicable. Reranker capacity planning.
INC-08

Quality regression after prompt change

P2
Detection Signals
  • Eval score drop after deploy
  • Customer feedback shift
  • Internal QA flag
Immediate Triage
  • Roll back the prompt to previous version
  • Compare A/B against rolled-back version
  • Annotate failing samples for analysis
Communication Template
“Internal: quality regression detected on feature X after prompt change at time Y. Reverted; investigating root cause.”
Root Cause Patterns
  • Prompt change tested only on happy path
  • Edge cases broke
  • Negative-space examples not in eval
Prevention Update
Mandatory golden-set eval before prompt deploy. Negative-space examples in eval. Canary deployment for prompts.
INC-09

Tool-use loop / runaway agent

P1
Detection Signals
  • Tool call count spike per session
  • Long-running session metric
  • Cost-per-session anomaly
Immediate Triage
  • Hard cap tool calls per session
  • Kill long sessions over a limit
  • Disable specific tool if it is the loop driver
Communication Template
“Internal: agent loop detected. Per-session call cap reduced. Investigation in progress.”
Root Cause Patterns
  • Tool returns ambiguous result leading to retry
  • Plan-and-execute prompt encourages repetition
  • No completion criterion in prompt
Prevention Update
Per-session call budget. Completion criterion in system prompt. Tool result validation before next step.
INC-10

Compliance audit finding on AI

P1
Detection Signals
  • External or internal audit finding
  • Regulator inquiry
  • Customer compliance review failure
Immediate Triage
  • Document scope and timeline of finding
  • Engage compliance and legal
  • Suspend specific affected feature if recommended
Communication Template
“Internal: audit finding requires response by date D. Engaging legal and compliance. Customer-facing communication pending legal review.”
Root Cause Patterns
  • Missing audit log
  • Insufficient evidence for compliance baseline control
  • Configuration drift since last audit
Prevention Update
Audit log retention review. Map every control in 12-control baseline to evidence. Schedule quarterly pre-audit.
INC-11

Embedding corpus poisoning

P0
Detection Signals
  • Anomalous response cluster from specific retrieval pattern
  • Unusual documents appearing high in retrieval
  • External report of malicious upload
Immediate Triage
  • Disable user-content ingestion temporarily
  • Re-embed the corpus from trusted source
  • Quarantine suspicious documents
Communication Template
“Internal P0: potential corpus poisoning. User-content ingestion disabled. Re-indexing in progress.”
Root Cause Patterns
  • Unauthenticated user-content ingestion
  • Insufficient content moderation
  • Adversarial uploader exploiting public path
Prevention Update
Authenticate every ingestion path. Content classifier at ingestion. Provenance tracking on every chunk.
INC-12

Token / credential leak in prompt

P0
Detection Signals
  • Credential appearing in vendor logs
  • Customer credential rotation alert
  • Secrets scanner alert
Immediate Triage
  • Rotate the credential immediately
  • Audit access logs for misuse during exposure window
  • Disable the integration that leaked
Communication Template
“Internal P0: credential exposure. Rotation complete; access logs under review for misuse window of Xh.”
Root Cause Patterns
  • Credential included in retrieved chunk
  • User pasted secret into chat which was logged
  • Debug logging exposing secret
Prevention Update
Secret pre-filter on every user input and retrieved chunk. Vendor log redaction policy. Never log raw prompts in production.
INC-13

Stale data served by RAG after source change

P2
Detection Signals
  • Customer report of outdated info
  • Reindex job failure log
  • Cache invalidation alarm
Immediate Triage
  • Force reindex of affected source
  • Disable cache for the affected topic
  • Confirm freshness via spot-check
Communication Template
“Customer-facing: we identified that some responses were based on older data. Index is refreshed; please retry.”
Root Cause Patterns
  • Failed reindex job that was not alerted
  • Cache TTL too long for source change cadence
  • Webhook from source dropped
Prevention Update
Reindex job health alarm. Source-change webhook retries with DLQ. Per-source freshness SLO.
INC-14

Regulatory boundary crossing (data residency)

P0
Detection Signals
  • Vendor processing log showing cross-border data flow
  • Compliance review finding
  • Customer escalation
Immediate Triage
  • Route affected traffic to in-region endpoint
  • Engage legal
  • Document scope of cross-border processing
Communication Template
“Internal P0: data-residency boundary may have been crossed. Routing corrected. Legal evaluating notification obligations.”
Root Cause Patterns
  • Vendor request routed to unexpected region
  • Self-hosted fallback in different jurisdiction
  • DPA not covering the routing
Prevention Update
Vendor SLA includes region pinning. Self-hosted fallback in-region. Map every data flow against DPA matrix.

Common Post-Incident Actions

ActionOwnerDeliverable
Postmortem within 5 business daysIncident CommanderDocument with timeline, impact, root cause, action items, owners, dates.
Update the AI Failure Modes Catalog if a new mode emergedAI Platform LeadPR to internal failure-modes register.
Update governance baseline if a control failedAI Risk LeadUpdated baseline + assessment of other systems against the same control.
Update vendor scorecard if vendor-relatedProcurementScore adjustment + next-review date pulled forward.