My Live Trading Bot Was Hung for 7 Hours. Here's the System That Fixed It.
A $50-bet live trading bot silently hung for 7 hours while generating STRONG signals. No alerts. No restarts. I diagnosed the asyncio event loop failure, killed the process manually, and then built Horus — a self-healing watchdog daemon that would have caught it in under 10 minutes.

At 4:19 PM, my live trading bot stopped logging.
Not crashed. Not dead. Technically still running — the process showed up in ps aux, file descriptors were open, CPU was ticking. But the asyncio event loop had silently hung on an external API call, the heartbeat stopped, and for the next 7 hours the bot sat there generating STRONG signals it could never execute.
I discovered it at 11 PM when I checked the dashboard. By then, ETH/daily had been scoring 89/100 — STRONG conviction UP — every single second. For seven hours. Against a window with 19 hours left. Never fired a single order.
That's not a bug. That's a platform design failure.
The Diagnosis
The bot (polymarket-bot) runs as a Python asyncio process under a launchd daemon on my Mac Mini. It evaluates crypto prediction markets — BTC, ETH, SOL, XRP across multiple timeframes — and places bets on Polymarket's CLOB when it detects an edge.
The architecture has:
- A main asyncio event loop that scans active markets every second
- A strategy engine that scores each market (PolyEdge v2.0, 9-factor, 0-100)
- Execution via
place_maker_order()→wait_for_fill()→ taker fallback - A heartbeat log line every 31 seconds
When the heartbeat stopped at 16:18:06 and the log froze at 16:19:27, the diagnosis was straightforward in hindsight:
The bot reached _execute_signal(), called place_maker_order(), and the awaitable never returned.
No timeout. No cancellation. Just await self.polymarket.place_maker_order(...) sitting there indefinitely while the event loop starved every other coroutine — including the heartbeat, the market scanner, and every other trade evaluation.
# What the code looked like (simplified)
order_id = await self.polymarket.place_maker_order(market, direction, bet_size)
# If this hangs → entire event loop is blocked → everything stops
The launchd while true restart loop only fires when the process exits. A hung process doesn't exit. It just... waits. Forever.
A hung asyncio event loop is worse than a crash. A crash triggers your restart logic. A hang looks alive to your process monitor while being completely inert to your trading logic. You get the worst of both worlds: the system thinks it's healthy while it's completely dead.
What Should Have Been There
This failure exposed three missing layers that every long-running production process needs:
The fix for Layer 1 is simple: wrap every external awaitable in asyncio.wait_for():
# Before (hangs indefinitely)
order_id = await self.polymarket.place_maker_order(market, direction, bet_size)
# After (times out and raises TimeoutError after 30s)
order_id = await asyncio.wait_for(
self.polymarket.place_maker_order(market, direction, bet_size),
timeout=30.0
)
The fix for Layer 2 is an in-process watchdog thread that runs independently of the event loop:
import threading, os, signal, time
class EventLoopWatchdog:
"""Separate thread that kills the process if the event loop stalls."""
def __init__(self, timeout_seconds: int = 120):
self.timeout = timeout_seconds
self._last_beat = time.monotonic()
self._thread = threading.Thread(target=self._run, daemon=True)
def beat(self):
"""Call this in your main loop to signal liveness."""
self._last_beat = time.monotonic()
def start(self):
self._thread.start()
def _run(self):
while True:
time.sleep(15)
stale = time.monotonic() - self._last_beat
if stale > self.timeout:
# Process is hung — kill it so launchd can restart
os.kill(os.getpid(), signal.SIGTERM)
But Layers 1 and 2 are in-process fixes. They require code changes to every affected service, and they still leave you exposed when a new blocking call inevitably slips through. You need the external layer regardless.
That's why I built Horus.
Building the All-Seeing Eye
Horus is a config-driven self-healing watchdog daemon. Named after the Egyptian god whose eye is the symbol of protection, royal power, and omniscience — which felt right for something that watches everything and fixes what breaks.
The design mandate:
- No dependencies — pure Python stdlib, runs on any machine without a virtualenv
- Config-driven — adding a new watch is editing a JSON file, not code
- Self-healing first — not just monitoring and alerting, but actively restarting failed services
- Notification-independent — Horus must survive even when the systems it notifies are down
- Launchd-backed —
KeepAlive=trueso Horus itself never dies
The core loop is dead simple:
for monitor in config["monitors"]:
healthy, reason = check_fn(monitor["check"])
if not healthy:
state.consecutive_failures += 1
if state.consecutive_failures >= min_failures and cooldown_expired:
heal_fn(monitor["heal"])
notify_openclaw(f"Healed {monitor['name']}: {reason}")
Six check types:
log_staleness— is the log file fresh within N seconds?process_alive— is the process name running?http_health— does the URL return 2xx?port_open— is the TCP port accepting connections?launchd_service— is the launchd service active with a live PID?file_exists— does a required file exist and is it recent?
Four heal actions:
kill_process— SIGTERM/SIGKILL by process pattern (launchd restarts it)launchctl_restart—launchctl kickstart -ka named servicedocker_restart—docker compose restart <container>run_script— run any arbitrary bash command
The minimum failure threshold before healing is configurable per service — polymarket-bot uses 2, while mc-backend uses 3. A single check failure could be a transient network blip or a process mid-restart. Requiring consecutive failures before acting prevents Horus from triggering restart loops on healthy services.
The config for the trading bot looks like this:
{
"name": "polymarket-bot",
"enabled": true,
"min_failures_to_heal": 2,
"heal_cooldown_seconds": 180,
"check": {
"type": "log_staleness",
"log_file": "~/Documents/Dev/polymarket-bot/bot.log",
"max_age_seconds": 300
},
"heal": {
"action": "kill_process",
"process_pattern": "main.py --mode conservative",
"signal": "SIGTERM"
}
}
With this configuration, here's what would have happened tonight:
- 4:19 PM — bot hangs, log stops updating
- 4:24 PM — Horus detects log is stale (5 min threshold) → failure #1
- 4:24:30 PM — Horus detects log still stale → failure #2 → sends SIGTERM
- 4:25 PM — launchd restarts bot → bot begins scanning markets
- 4:26 PM — ETH/daily STRONG signal evaluates → order placed
Instead of 7 hours of missed opportunity, the gap would have been under 10 minutes.
What Horus Watches
Seven services are now under Horus watch. The polymarket-bot watch alone would have caught tonight's incident:
polymarket-bot log_staleness 5m → kill_process
mc-backend port_open :8001 → docker_restart
mc-frontend port_open :5174 → docker_restart
openclaw-gateway port_open :18789 → launchctl_restart
excalidraw-mcp port_open :3000 → launchctl_restart
invictus-backend port_open :8000 → docker_restart
invictus-frontend port_open :5173 → docker_restart
It runs on a 30-second cycle under its own launchd daemon (com.knox.horus, KeepAlive=true), so Horus itself is guarded by the same self-healing mechanism it provides to everything else.
The Night It Proved Itself
Horus was running for less than two minutes before it found a problem.
The mc-backend container (my Mission Control dashboard backend) was returning 404 on the health endpoint. Horus waited for three consecutive failures, then issued docker compose restart mc-backend. The container restarted. Horus logged the recovery and moved on.
I hadn't even noticed mc-backend was down.
That's the point. Self-healing infrastructure should be boring. The fix should happen before you ever know there was a problem.
The Mandate Going Forward
Here's the rule I'm enforcing across every system in this stack going forward:
Every long-running process needs three things: (1) timeouts on ALL external API calls, (2) an in-process watchdog thread that kills the process if the event loop stalls, and (3) a Horus watch entry so the external layer catches what the internal layers miss. No exceptions.
The specific implementation pattern for asyncio services:
# 1. Timeout every external call
result = await asyncio.wait_for(external_call(), timeout=30.0)
# 2. Add a watchdog thread that calls os.kill() if the loop stalls
watchdog = EventLoopWatchdog(timeout_seconds=120)
watchdog.start()
# 3. Beat the watchdog in your main loop
while True:
watchdog.beat()
await evaluate_markets()
await asyncio.sleep(1)
# 4. Register in Horus config (log_staleness, 5 min threshold)
For non-asyncio services: any HTTP endpoint, port, or log file age is enough for Horus to detect and heal.
The bot lost 7 hours tonight. It won't lose 7 minutes again.
Horus is available at ~/Documents/Dev/horus/ if you're running the same Mac Mini infrastructure stack. The config is JSON, the daemon is pure Python stdlib, and the whole thing is under 300 lines.
Explore the Invictus Labs Ecosystem
Follow the Signal
If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

Foresight v5.0: How I Rebuilt a Prediction Market Bot Around Candle Boundaries
The bot was right. The timing was wrong. v4.x had a fundamental reactive architecture problem — by the time signals scored, the CLOB asks were too expensive. v5.0 solved it with event-driven candle boundaries and predictive early-window scoring.

Hermes: A Political Oracle That Bets on Polymarket Using AI News Intelligence
Political prediction markets don't move on charts — they move on information. Hermes is a Python bot that scores political markets using Grok sentiment, Perplexity probability estimation, and calibration consensus from Metaculus and Manifold. Here's how it works.

Leverage: Porting the Foresight Signal Stack to Crypto Perpetuals
The signal stack I built for prediction markets turns out to work on perpetual futures — with modifications. Here's how a 9-factor scoring engine, conviction-scaled leverage, and six independent risk gates become a perps trading system.