
Engineering
Position-Level Orphan Detection: Why Heartbeat Monitoring Isn't Enough
A 16-hour outage taught me that bot liveness and position safety are separate monitoring concerns. The gap between them is where losses live.
June 4, 20267m read

A 16-hour outage taught me that bot liveness and position safety are separate monitoring concerns. The gap between them is where losses live.

A trading bot ran dead for seven hours before I noticed. That failure taught me that 'process alive' and 'system healthy' are not the same sentence. Here's how I built infrastructure that heals itself, documents itself, and gets smarter every time it breaks.