My Live Trading Bot Was Hung for 7 Hours. Here's the System That Fixed It.

At 4:19 PM, my live trading bot stopped logging.

Not crashed. Not dead. Technically still running — the process showed up in ps aux, file descriptors were open, CPU was ticking. But the asyncio event loop had silently hung on an external API call, the heartbeat stopped, and for the next 7 hours the bot sat there generating STRONG signals it could never execute.

I discovered it at 11 PM when I checked the dashboard. By then, ETH/daily had been scoring 89/100 — STRONG conviction UP — every single second. For seven hours. Against a window with 19 hours left. Never fired a single order.

That's not a bug. That's a platform design failure.

The Diagnosis

The bot (polymarket-bot) runs as a Python asyncio process under a launchd daemon on my Mac Mini. It evaluates crypto prediction markets — BTC, ETH, SOL, XRP across multiple timeframes — and places bets on Polymarket's CLOB when it detects an edge.

The architecture has:

A main asyncio event loop that scans active markets every second
A strategy engine that scores each market (PolyEdge v2.0, 9-factor, 0-100)
Execution via place_maker_order() → wait_for_fill() → taker fallback
A heartbeat log line every 31 seconds

When the heartbeat stopped at 16:18:06 and the log froze at 16:19:27, the diagnosis was straightforward in hindsight:

The bot reached _execute_signal(), called place_maker_order(), and the awaitable never returned.

No timeout. No cancellation. Just await self.polymarket.place_maker_order(...) sitting there indefinitely while the event loop starved every other coroutine — including the heartbeat, the market scanner, and every other trade evaluation.

# What the code looked like (simplified)
order_id = await self.polymarket.place_maker_order(market, direction, bet_size)
# If this hangs → entire event loop is blocked → everything stops

The launchd while true restart loop only fires when the process exits. A hung process doesn't exit. It just... waits. Forever.

⚠WARNING

A hung asyncio event loop is worse than a crash. A crash triggers your restart logic. A hang looks alive to your process monitor while being completely inert to your trading logic. You get the worst of both worlds: the system thinks it's healthy while it's completely dead.

What Should Have Been There

This failure exposed three missing layers that every long-running production process needs:

Layer 1

API Timeouts

asyncio.wait_for() on every external call

Layer 2

In-Process Watchdog

Thread that kills the process if event loop stalls

Layer 3

External Watchdog

Daemon that detects staleness and heals

The fix for Layer 1 is simple: wrap every external awaitable in asyncio.wait_for():

# Before (hangs indefinitely)
order_id = await self.polymarket.place_maker_order(market, direction, bet_size)

# After (times out and raises TimeoutError after 30s)
order_id = await asyncio.wait_for(
    self.polymarket.place_maker_order(market, direction, bet_size),
    timeout=30.0
)

The fix for Layer 2 is an in-process watchdog thread that runs independently of the event loop:

import threading, os, signal, time

class EventLoopWatchdog:
    """Separate thread that kills the process if the event loop stalls."""

    def __init__(self, timeout_seconds: int = 120):
        self.timeout = timeout_seconds
        self._last_beat = time.monotonic()
        self._thread = threading.Thread(target=self._run, daemon=True)

    def beat(self):
        """Call this in your main loop to signal liveness."""
        self._last_beat = time.monotonic()

    def start(self):
        self._thread.start()

    def _run(self):
        while True:
            time.sleep(15)
            stale = time.monotonic() - self._last_beat
            if stale > self.timeout:
                # Process is hung — kill it so launchd can restart
                os.kill(os.getpid(), signal.SIGTERM)

But Layers 1 and 2 are in-process fixes. They require code changes to every affected service, and they still leave you exposed when a new blocking call inevitably slips through. You need the external layer regardless.

That's why I built Horus.

Building the All-Seeing Eye

Horus is a config-driven self-healing watchdog daemon. Named after the Egyptian god whose eye is the symbol of protection, royal power, and omniscience — which felt right for something that watches everything and fixes what breaks.

The design mandate:

No dependencies — pure Python stdlib, runs on any machine without a virtualenv
Config-driven — adding a new watch is editing a JSON file, not code
Self-healing first — not just monitoring and alerting, but actively restarting failed services
Notification-independent — Horus must survive even when the systems it notifies are down
Launchd-backed — KeepAlive=true so Horus itself never dies

The core loop is dead simple:

for monitor in config["monitors"]:
    healthy, reason = check_fn(monitor["check"])

    if not healthy:
        state.consecutive_failures += 1

        if state.consecutive_failures >= min_failures and cooldown_expired:
            heal_fn(monitor["heal"])
            notify_openclaw(f"Healed {monitor['name']}: {reason}")

Six check types:

log_staleness — is the log file fresh within N seconds?
process_alive — is the process name running?
http_health — does the URL return 2xx?
port_open — is the TCP port accepting connections?
launchd_service — is the launchd service active with a live PID?
file_exists — does a required file exist and is it recent?

Four heal actions:

kill_process — SIGTERM/SIGKILL by process pattern (launchd restarts it)
launchctl_restart — launchctl kickstart -k a named service
docker_restart — docker compose restart <container>
run_script — run any arbitrary bash command

◈INSIGHT

The minimum failure threshold before healing is configurable per service — polymarket-bot uses 2, while mc-backend uses 3. A single check failure could be a transient network blip or a process mid-restart. Requiring consecutive failures before acting prevents Horus from triggering restart loops on healthy services.

The config for the trading bot looks like this:

{
  "name": "polymarket-bot",
  "enabled": true,
  "min_failures_to_heal": 2,
  "heal_cooldown_seconds": 180,
  "check": {
    "type": "log_staleness",
    "log_file": "~/Documents/Dev/polymarket-bot/bot.log",
    "max_age_seconds": 300
  },
  "heal": {
    "action": "kill_process",
    "process_pattern": "main.py --mode conservative",
    "signal": "SIGTERM"
  }
}

With this configuration, here's what would have happened tonight:

4:19 PM — bot hangs, log stops updating
4:24 PM — Horus detects log is stale (5 min threshold) → failure #1
4:24:30 PM — Horus detects log still stale → failure #2 → sends SIGTERM
4:25 PM — launchd restarts bot → bot begins scanning markets
4:26 PM — ETH/daily STRONG signal evaluates → order placed

Instead of 7 hours of missed opportunity, the gap would have been under 10 minutes.

What Horus Watches

Seven services are now under Horus watch. The polymarket-bot watch alone would have caught tonight's incident:

polymarket-bot    log_staleness  5m     → kill_process
mc-backend        port_open      :8001  → docker_restart
mc-frontend       port_open      :5174  → docker_restart
openclaw-gateway  port_open      :18789 → launchctl_restart
excalidraw-mcp    port_open      :3000  → launchctl_restart
invictus-backend  port_open      :8000  → docker_restart
invictus-frontend port_open      :5173  → docker_restart

It runs on a 30-second cycle under its own launchd daemon (com.knox.horus, KeepAlive=true), so Horus itself is guarded by the same self-healing mechanism it provides to everything else.

The Night It Proved Itself

Horus was running for less than two minutes before it found a problem.

The mc-backend container (my Mission Control dashboard backend) was returning 404 on the health endpoint. Horus waited for three consecutive failures, then issued docker compose restart mc-backend. The container restarted. Horus logged the recovery and moved on.

I hadn't even noticed mc-backend was down.

That's the point. Self-healing infrastructure should be boring. The fix should happen before you ever know there was a problem.

The Mandate Going Forward

Here's the rule I'm enforcing across every system in this stack going forward:

◈INSIGHT

Every long-running process needs three things: (1) timeouts on ALL external API calls, (2) an in-process watchdog thread that kills the process if the event loop stalls, and (3) a Horus watch entry so the external layer catches what the internal layers miss. No exceptions.

The specific implementation pattern for asyncio services:

# 1. Timeout every external call
result = await asyncio.wait_for(external_call(), timeout=30.0)

# 2. Add a watchdog thread that calls os.kill() if the loop stalls
watchdog = EventLoopWatchdog(timeout_seconds=120)
watchdog.start()

# 3. Beat the watchdog in your main loop
while True:
    watchdog.beat()
    await evaluate_markets()
    await asyncio.sleep(1)

# 4. Register in Horus config (log_staleness, 5 min threshold)

For non-asyncio services: any HTTP endpoint, port, or log file age is enough for Horus to detect and heal.

The bot lost 7 hours tonight. It won't lose 7 minutes again.

Horus is available at ~/Documents/Dev/horus/ if you're running the same Mac Mini infrastructure stack. The config is JSON, the daemon is pure Python stdlib, and the whole thing is under 300 lines.

My Live Trading Bot Was Hung for 7 Hours. Here's the System That Fixed It.

The Diagnosis

What Should Have Been There

Building the All-Seeing Eye

What Horus Watches

The Night It Proved Itself

The Mandate Going Forward

Follow the Signal

Text Fidelity Is the New Image Quality

Frustration Is the Raw Material: The Only Retro Discipline That Matters

Grep the Consumers Before Writing the Producer