Incident Response
Structured production incident response: triage severity, contain the blast, create P0 ticket, gather evidence, run investigation, and generate post-mortem.
How It Works
- Production incident: Outage · bad data · revenue loss · bot down
- Triage + contain the blast: Assess SEV · halt the affected service first
- Open P0 incident file: .incidents/{slug}.md · timeline + hypotheses, append-only
- Gather evidence: Logs · errors · disk/mem/load · listening ports
- Investigate hypotheses: Test ONE at a time · eliminate + append, never re-test
- Resolve + verify healthy: Fix · restart · confirm via test transaction
- Post-mortem: Root cause · prevention items · lessons.md updated
Invocation Triggers
/incidentproduction downservice downoutageemergencybot downUse Cases
- Service is down and you need a structured response immediately
- Trading bot executing bad trades — halt and investigate
- Generate a post-mortem after a resolved incident
The Problem
Production goes down and your instincts work against you. You start debugging while the service is still writing bad data, compounding the damage. You hold the timeline in your head, so the moment you context-switch it is gone. You test three hypotheses at once and cannot tell which action caused which effect. By the time it is fixed, you cannot reconstruct what happened — so there is no post-mortem, no prevention item, and the same incident comes back next month.
What It Does
- 1Triage severity in under 60 seconds
Classifies the incident SEV-1 through SEV-3. Active revenue loss, wrong data being written, or a security breach is SEV-1 and halts immediately. When in doubt, it halts — a stopped service beats a broken one running.
- 2Contain the blast
Finds the process and hard-stops it: kill -9, launchctl unload, or systemctl stop. The rule is absolute — contain unless the blast radius is demonstrably zero, then investigate.
- 3Open the incident file
Writes .incidents/{slug}.md with severity, impact, a Current Action line overwritten before every move, and an append-only Timeline. The file is the single source of truth — not your head.
- 4Gather evidence
Pulls the last 100 log lines, greps for error/exception/fatal/critical, and checks disk, memory, load, and listening ports. Every finding lands in the Timeline.
- 5Investigate one hypothesis at a time
Forms specific, falsifiable hypotheses and tests exactly one. Eliminated hypotheses are appended, never re-tested. No parallel testing — it contaminates results under pressure.
- 6Resolve, then post-mortem
Implements the fix, restarts, and verifies health with a real test transaction. Within 24h it writes the post-mortem: root cause, contributing factors, detection latency, and at least one prevention item into lessons.md.
What You Get / What It Doesn't Do
- A halted service before the damage compounds
- An .incidents/{slug}.md file with an append-only timeline
- A confirmed root cause, tested one hypothesis at a time
- A post-mortem with real prevention items, not template headers
- A lessons.md entry with detection latency and the alerting gap
- Decide the business trade-off of halting a revenue service for you
- Fix the root cause without confirming it with a test transaction
- Skip the post-mortem because the incident felt minor
- Re-test a hypothesis it already eliminated to feel thorough
Tips
The instinct to diagnose first is the expensive one. A degraded service writing bad data causes compounding damage. Stop it, then figure out why — the order is not negotiable.
Not after. The moment you crash, context-switch, or hand off, anything in your head is lost. The Current Action line and the Timeline are what survive you.
Looping back to a dead hypothesis means you did not record why it was dead precisely enough. Fix the elimination note, not your patience.
Get the Skill
Unlock the full Incident Response SKILL.md — drop it into ~/.claude/skills/ and trigger it by name.
- A halted service before the damage compounds
- An .incidents/{slug}.md file with an append-only timeline
- A confirmed root cause, tested one hypothesis at a time
- A post-mortem with real prevention items, not template headers
Commonly Used With
29 more production skills ready to install.