← Skills Library
DevOps & Ops

Incident Response

Structured production incident response: triage severity, contain the blast, create P0 ticket, gather evidence, run investigation, and generate post-mortem.

How It Works

Incident Response · Workflow
Contain first, investigate one hypothesis at a time, post-mortem.
TriggerProduction incident · Outage · bad data · revenue loss · bot down
1
Triage + contain the blast
Assess SEV · halt the affected service first
2
Open P0 incident file
.incidents/{slug}.md · timeline + hypotheses, append-only
3
Gather evidence
Logs · errors · disk/mem/load · listening ports
4
Investigate hypothesesGATE
Test ONE at a time · eliminate + append, never re-test
failEliminate and form next hypothesis
5
Resolve + verify healthy
Fix · restart · confirm via test transaction
Post-mortem · Root cause · prevention items · lessons.md updated
  1. Production incident: Outage · bad data · revenue loss · bot down
  2. Triage + contain the blast: Assess SEV · halt the affected service first
  3. Open P0 incident file: .incidents/{slug}.md · timeline + hypotheses, append-only
  4. Gather evidence: Logs · errors · disk/mem/load · listening ports
  5. Investigate hypotheses: Test ONE at a time · eliminate + append, never re-test
  6. Resolve + verify healthy: Fix · restart · confirm via test transaction
  7. Post-mortem: Root cause · prevention items · lessons.md updated
ↆ download card

Invocation Triggers

/incidentproduction downservice downoutageemergencybot down

Use Cases

  • Service is down and you need a structured response immediately
  • Trading bot executing bad trades — halt and investigate
  • Generate a post-mortem after a resolved incident

The Problem

Production goes down and your instincts work against you. You start debugging while the service is still writing bad data, compounding the damage. You hold the timeline in your head, so the moment you context-switch it is gone. You test three hypotheses at once and cannot tell which action caused which effect. By the time it is fixed, you cannot reconstruct what happened — so there is no post-mortem, no prevention item, and the same incident comes back next month.

What It Does

  1. 1
    Triage severity in under 60 seconds

    Classifies the incident SEV-1 through SEV-3. Active revenue loss, wrong data being written, or a security breach is SEV-1 and halts immediately. When in doubt, it halts — a stopped service beats a broken one running.

  2. 2
    Contain the blast

    Finds the process and hard-stops it: kill -9, launchctl unload, or systemctl stop. The rule is absolute — contain unless the blast radius is demonstrably zero, then investigate.

  3. 3
    Open the incident file

    Writes .incidents/{slug}.md with severity, impact, a Current Action line overwritten before every move, and an append-only Timeline. The file is the single source of truth — not your head.

  4. 4
    Gather evidence

    Pulls the last 100 log lines, greps for error/exception/fatal/critical, and checks disk, memory, load, and listening ports. Every finding lands in the Timeline.

  5. 5
    Investigate one hypothesis at a time

    Forms specific, falsifiable hypotheses and tests exactly one. Eliminated hypotheses are appended, never re-tested. No parallel testing — it contaminates results under pressure.

  6. 6
    Resolve, then post-mortem

    Implements the fix, restarts, and verifies health with a real test transaction. Within 24h it writes the post-mortem: root cause, contributing factors, detection latency, and at least one prevention item into lessons.md.

What You Get / What It Doesn't Do

What you get
  • A halted service before the damage compounds
  • An .incidents/{slug}.md file with an append-only timeline
  • A confirmed root cause, tested one hypothesis at a time
  • A post-mortem with real prevention items, not template headers
  • A lessons.md entry with detection latency and the alerting gap
What it doesn't do
  • Decide the business trade-off of halting a revenue service for you
  • Fix the root cause without confirming it with a test transaction
  • Skip the post-mortem because the incident felt minor
  • Re-test a hypothesis it already eliminated to feel thorough

Tips

Halt before you understand

The instinct to diagnose first is the expensive one. A degraded service writing bad data causes compounding damage. Stop it, then figure out why — the order is not negotiable.

Update the file before every action

Not after. The moment you crash, context-switch, or hand off, anything in your head is lost. The Current Action line and the Timeline are what survive you.

If you re-test, your elimination was sloppy

Looping back to a dead hypothesis means you did not record why it was dead precisely enough. Fix the elimination note, not your patience.

Get the Skill

Elite SkillELITE

Unlock the full Incident Response SKILL.md — drop it into ~/.claude/skills/ and trigger it by name.

What you unlock
  • A halted service before the damage compounds
  • An .incidents/{slug}.md file with an append-only timeline
  • A confirmed root cause, tested one hypothesis at a time
  • A post-mortem with real prevention items, not template headers
...

Commonly Used With

Skills Library

29 more production skills ready to install.

Browse All Skills