Version 2026-06 · CC-BY-4.0

AI Incident Response Playbook

14 production AI incident classes — detection signals, immediate triage, communication templates, root cause patterns, prevention updates.

Save this as a PDF: press Ctrl+P (Windows/Linux) or ⌘+P (Mac), then choose “Save as PDF”. The page is print-styled for clean output.

How to use this playbook. Companion to the Failure Modes Catalog: failure modes are caught before launch; incidents are what reaches production despite review. For each class: confirm the detection signals match the situation, run the immediate triage steps, fire the communication template (substitute the variables), document the root cause against the listed patterns, file the prevention update so the next instance is caught by a control instead of by a human at 2 AM. Compiled from 150+ engagements where production AI systems experienced incidents requiring documented response.

P0 Critical — page on-call P1 High — mitigate within hours P2 Medium — within business day P3 Low — investigation only

INC-01

LLM cost explosion

Detection Signals

Spend metric exceeds daily budget by >2x
Token-per-query metric step-change
Vendor billing alert

Immediate Triage

Activate per-key rate limit at 25% of normal
Identify top-N consumers from logs
Roll back recent prompt/template changes if correlated

Communication Template

“Cost anomaly detected on LLM service at HH:MM. Throttling in effect; user impact is rate-limited responses. Root cause investigation in progress; ETA for normal capacity within Xh.”

Root Cause Patterns

Loose top-k in RAG retrieval
Removed reranker by mistake
Prompt change ballooning context
Abuse / scraping pattern

Prevention Update

Add token-per-query alarm at 1.5x baseline. Enforce per-key rate limit and budget cap. Add reranker required by policy.

INC-02

Hallucination at scale on a specific topic

Detection Signals

Customer complaint cluster mentioning same wrong fact
Quality eval regression on a topic subset
Citation rate drop on retrieved chunks

Immediate Triage

Add hard-coded refusal for the topic pending fix
Inspect retrieval against ground truth on representative queries
Confirm RAG corpus actually contains the correct info

Communication Template

“We have identified incorrect responses on topic X. The assistant is now declining the topic while we fix retrieval. ETA for return to service: Xh.”

Root Cause Patterns

Retrieval not surfacing the right chunk
Embedding model misalignment on terminology
Corpus stale or missing the fact
System prompt allowing speculation

Prevention Update

Add golden-set evaluation including the topic. Enforce citation requirement in prompt. Schedule corpus freshness audit.

INC-03

Silent model drift (vendor update)

Detection Signals

Eval score drop without code change
Output distribution shift
Customer feedback shift over a week

Immediate Triage

Pin to a known-good model version if vendor allows
A/B compare current vs prior version on golden set
Roll back affected prompts to the version-pinned model

Communication Template

“Internal: model drift suspected on vendor X. Performance regression of Y% observed. Pinning to prior version while we investigate.”

Root Cause Patterns

Vendor silently updated underlying model
Vendor deprecated feature relied on by prompt
Tokenizer changed under the same model name

Prevention Update

Version-pin every model in production. Subscribe to vendor changelog. Add eval to CI to catch drift.

INC-04

Prompt injection through retrieved content

Detection Signals

Anomalous tool calls
Output containing data not from current user
Suspicious system-prompt leakage in responses

Immediate Triage

Disable tool calling for retrieved-content sessions
Quarantine the suspect documents
Force re-auth on affected sessions

Communication Template

“Internal P0: prompt injection vector detected through document upload. Tools disabled while we investigate scope. No customer notification yet pending impact assessment.”

Root Cause Patterns

Retrieved content treated as instruction
User-uploaded document not filtered
RAG corpus polluted with adversarial entries

Prevention Update

System prompt that forbids following retrieved instructions. Output validation. Content scanning before ingestion. Tool-call allowlist independent of LLM output.

INC-05

PII leak in LLM output

Detection Signals

DLP alert
Customer report
Audit log entry showing PII in response

Immediate Triage

Disable affected endpoint
Engage legal and privacy officer
Identify scope: which users, what data, how many calls

Communication Template

“Internal P0: PII appearance in LLM output identified on endpoint X. Endpoint disabled. Legal engaged. Detailed scope assessment within 4h.”

Root Cause Patterns

PII in RAG corpus that should have been redacted
Insufficient output filtering
Cross-tenant retrieval leak
Model trained on PII (vendor side)

Prevention Update

PII scrubbing on ingestion AND on output. Tenant filter at retrieval layer. Output DLP scan. Periodic red-team.

INC-06

Vendor outage

Detection Signals

Vendor status page yellow/red
Spike in 5xx from vendor API
Latency p95 exceeds threshold

Immediate Triage

Activate self-hosted fallback OR alternative vendor via gateway
Set feature flag to degraded mode for non-critical AI paths
Update status page

Communication Template

“Some AI features are running in degraded mode due to upstream vendor outage. Full functionality will return when the vendor recovers; we will update status hourly.”

Root Cause Patterns

Vendor incident
Regional vendor outage with no multi-region fallback
Rate-limit hit because our usage pattern changed

Prevention Update

Gateway-based multi-vendor with automatic failover. Self-hosted fallback for compliance-critical paths. Graceful degradation feature flags.

INC-07

Latency degradation

Detection Signals

p95 over SLA for >15 min
Queue depth growing
Customer satisfaction metric drop

Immediate Triage

Add caching layer for hot prompts
Shed non-critical traffic
Check vendor status; switch region if applicable

Communication Template

“We are seeing higher response times in AI features. Mitigations in effect; normal latency expected within Xh.”

Root Cause Patterns

Vendor regional latency
Context window grew due to corpus growth
Reranker is the bottleneck
Cold cache after deploy

Prevention Update

Latency SLO alarms. Hot-prompt cache. Streaming responses where applicable. Reranker capacity planning.

INC-08

Quality regression after prompt change

Detection Signals

Eval score drop after deploy
Customer feedback shift
Internal QA flag

Immediate Triage

Roll back the prompt to previous version
Compare A/B against rolled-back version
Annotate failing samples for analysis

Communication Template

“Internal: quality regression detected on feature X after prompt change at time Y. Reverted; investigating root cause.”

Root Cause Patterns

Prompt change tested only on happy path
Edge cases broke
Negative-space examples not in eval

Prevention Update

Mandatory golden-set eval before prompt deploy. Negative-space examples in eval. Canary deployment for prompts.

INC-09

Tool-use loop / runaway agent

Detection Signals

Tool call count spike per session
Long-running session metric
Cost-per-session anomaly

Immediate Triage

Hard cap tool calls per session
Kill long sessions over a limit
Disable specific tool if it is the loop driver

Communication Template

“Internal: agent loop detected. Per-session call cap reduced. Investigation in progress.”

Root Cause Patterns

Tool returns ambiguous result leading to retry
Plan-and-execute prompt encourages repetition
No completion criterion in prompt

Prevention Update

Per-session call budget. Completion criterion in system prompt. Tool result validation before next step.

INC-10

Compliance audit finding on AI

Detection Signals

External or internal audit finding
Regulator inquiry
Customer compliance review failure

Immediate Triage

Document scope and timeline of finding
Engage compliance and legal
Suspend specific affected feature if recommended

Communication Template

“Internal: audit finding requires response by date D. Engaging legal and compliance. Customer-facing communication pending legal review.”

Root Cause Patterns

Missing audit log
Insufficient evidence for compliance baseline control
Configuration drift since last audit

Prevention Update

Audit log retention review. Map every control in 12-control baseline to evidence. Schedule quarterly pre-audit.

INC-11

Embedding corpus poisoning

Detection Signals

Anomalous response cluster from specific retrieval pattern
Unusual documents appearing high in retrieval
External report of malicious upload

Immediate Triage

Disable user-content ingestion temporarily
Re-embed the corpus from trusted source
Quarantine suspicious documents

Communication Template

“Internal P0: potential corpus poisoning. User-content ingestion disabled. Re-indexing in progress.”

Root Cause Patterns

Unauthenticated user-content ingestion
Insufficient content moderation
Adversarial uploader exploiting public path

Prevention Update

Authenticate every ingestion path. Content classifier at ingestion. Provenance tracking on every chunk.

INC-12

Token / credential leak in prompt

Detection Signals

Credential appearing in vendor logs
Customer credential rotation alert
Secrets scanner alert

Immediate Triage

Rotate the credential immediately
Audit access logs for misuse during exposure window
Disable the integration that leaked

Communication Template

“Internal P0: credential exposure. Rotation complete; access logs under review for misuse window of Xh.”

Root Cause Patterns

Credential included in retrieved chunk
User pasted secret into chat which was logged
Debug logging exposing secret

Prevention Update

Secret pre-filter on every user input and retrieved chunk. Vendor log redaction policy. Never log raw prompts in production.

INC-13

Stale data served by RAG after source change

Detection Signals

Customer report of outdated info
Reindex job failure log
Cache invalidation alarm

Immediate Triage

Force reindex of affected source
Disable cache for the affected topic
Confirm freshness via spot-check

Communication Template

“Customer-facing: we identified that some responses were based on older data. Index is refreshed; please retry.”

Root Cause Patterns

Failed reindex job that was not alerted
Cache TTL too long for source change cadence
Webhook from source dropped

Prevention Update

Reindex job health alarm. Source-change webhook retries with DLQ. Per-source freshness SLO.

INC-14

Regulatory boundary crossing (data residency)

Detection Signals

Vendor processing log showing cross-border data flow
Compliance review finding
Customer escalation

Immediate Triage

Route affected traffic to in-region endpoint
Engage legal
Document scope of cross-border processing

Communication Template

“Internal P0: data-residency boundary may have been crossed. Routing corrected. Legal evaluating notification obligations.”

Root Cause Patterns

Vendor request routed to unexpected region
Self-hosted fallback in different jurisdiction
DPA not covering the routing

Prevention Update

Vendor SLA includes region pinning. Self-hosted fallback in-region. Map every data flow against DPA matrix.

Common Post-Incident Actions

Action	Owner	Deliverable
Postmortem within 5 business days	Incident Commander	Document with timeline, impact, root cause, action items, owners, dates.
Update the AI Failure Modes Catalog if a new mode emerged	AI Platform Lead	PR to internal failure-modes register.
Update governance baseline if a control failed	AI Risk Lead	Updated baseline + assessment of other systems against the same control.
Update vendor scorecard if vendor-related	Procurement	Score adjustment + next-review date pulled forward.

About this playbook. Compiled from 150+ Slavin AI & SLAtech engagements 2022–2026 where production AI systems experienced incidents requiring documented response. Each incident class has been observed at least twice across independent clients. Playbook steps reflect what was actually done; entries are documented as patterns, not specific events.

License. Creative Commons Attribution 4.0 International (CC-BY-4.0). Reuse with credit to “Slavin AI” and a link to slavin.ai/Checklist/AI-Incident-Response.

Machine-readable version. The same playbook as JSON: slavin.ai/data/ai-incident-response-playbook.json.

Companion catalog. Failure modes caught before launch: AI Pre-Production Review Checklist.

Read the position page →

Slavin AI is a brand of SLAtech LTD. slatech.co.il · Contact