DevOps & Ops

Incident Response

Structured production incident response: triage severity, contain the blast, create P0 ticket, gather evidence, run investigation, and generate post-mortem.

How It Works

Incident Response · Workflow

Contain first, investigate one hypothesis at a time, post-mortem.

TRIGGERProduction incidentOutage · bad data · revenue loss · bot down

STEP 1Triage + contain the blastAssess SEV · halt the affected service first

STEP 2Open P0 incident file.incidents/{slug}.md · timeline + hypotheses, append-only

STEP 3Gather evidenceLogs · errors · disk/mem/load · listening ports

GATEInvestigate hypothesesTest ONE at a time · eliminate + append, never re-test↻ fail → Eliminate and form next hypothesis

root cause found

STEP 4Resolve + verify healthyFix · restart · confirm via test transaction

OUTPUTPost-mortemRoot cause · prevention items · lessons.md updated

ↆ download card

Invocation Triggers

/incidentproduction downservice downoutageemergencybot down

Use Cases

Service is down and you need a structured response immediately
Trading bot executing bad trades — halt and investigate
Generate a post-mortem after a resolved incident

The Problem

Production goes down and your instincts work against you. You start debugging while the service is still writing bad data, compounding the damage. You hold the timeline in your head, so the moment you context-switch it is gone. You test three hypotheses at once and cannot tell which action caused which effect. By the time it is fixed, you cannot reconstruct what happened — so there is no post-mortem, no prevention item, and the same incident comes back next month.

What It Does

1
Triage severity in under 60 seconds
Classifies the incident SEV-1 through SEV-3. Active revenue loss, wrong data being written, or a security breach is SEV-1 and halts immediately. When in doubt, it halts — a stopped service beats a broken one running.
2
Contain the blast
Finds the process and hard-stops it: kill -9, launchctl unload, or systemctl stop. The rule is absolute — contain unless the blast radius is demonstrably zero, then investigate.
3
Open the incident file
Writes .incidents/{slug}.md with severity, impact, a Current Action line overwritten before every move, and an append-only Timeline. The file is the single source of truth — not your head.
4
Gather evidence
Pulls the last 100 log lines, greps for error/exception/fatal/critical, and checks disk, memory, load, and listening ports. Every finding lands in the Timeline.
5
Investigate one hypothesis at a time
Forms specific, falsifiable hypotheses and tests exactly one. Eliminated hypotheses are appended, never re-tested. No parallel testing — it contaminates results under pressure.
6
Resolve, then post-mortem
Implements the fix, restarts, and verifies health with a real test transaction. Within 24h it writes the post-mortem: root cause, contributing factors, detection latency, and at least one prevention item into lessons.md.

What You Get / What It Doesn't Do

What you get

A halted service before the damage compounds
An .incidents/{slug}.md file with an append-only timeline
A confirmed root cause, tested one hypothesis at a time
A post-mortem with real prevention items, not template headers
A lessons.md entry with detection latency and the alerting gap

What it doesn't do

Decide the business trade-off of halting a revenue service for you
Fix the root cause without confirming it with a test transaction
Skip the post-mortem because the incident felt minor
Re-test a hypothesis it already eliminated to feel thorough

Tips

Halt before you understand

The instinct to diagnose first is the expensive one. A degraded service writing bad data causes compounding damage. Stop it, then figure out why — the order is not negotiable.

Update the file before every action

Not after. The moment you crash, context-switch, or hand off, anything in your head is lost. The Current Action line and the Timeline are what survive you.

If you re-test, your elimination was sloppy

Looping back to a dead hypothesis means you did not record why it was dead precisely enough. Fix the elimination note, not your patience.

Get the Skill

Elite SkillELITE

Unlock the full Incident Response SKILL.md — drop it into ~/.claude/skills/ and trigger it by name.

What you unlock

A halted service before the damage compounds
An .incidents/{slug}.md file with an append-only timeline
A confirmed root cause, tested one hypothesis at a time
A post-mortem with real prevention items, not template headers

...

Commonly Used With

Build & DeployAuto FixClassifies runtime failures, locates root cause in the codebase, implements a targeted fix, runs tests, and redeploys. Full autonomous loop. No manual steps.IntelligenceDebug InvestigateScientific method debugging with a persistent eliminated-hypotheses log. Prevents the #1 AI debugging failure: re-testing disproven theories across context resets.

Skills Library

29 more production skills ready to install.

Browse All Skills