Self-Healing Architecture: Systems That Fix Themselves at 3 AM | AI Academy

Lesson 10 taught you to detect failure. Detection is table stakes. If all your watchdog does is send a Discord alert at 3 AM, you have built a very expensive alarm clock — one that requires a human to wake up, read the alert, SSH in, diagnose the problem, and restart the service.

That is not a production system. That is an on-call rotation for a team of one.

The real architecture goes further: detect the failure, confirm it is real, fix it automatically, log what happened, and only escalate when the automated fix does not work. Self-healing is the difference between a system that tells you it broke and a system that fixes itself while you sleep.

Three-Layer Healing Architecture

The Problem: Hung Processes and Silent Degradation

The gateway incident happened at 3 AM on a Tuesday. The OpenClaw gateway process was alive — PID running, port open, memory allocated. But it had stopped processing requests. The event loop was hung on a deadlocked async operation. No crash. No exit code. No signal for launchd to act on.

launchd saw a running process and did nothing. That is correct behavior — launchd's contract is to restart processes that exit. A process that hangs is invisible to launchd. It occupies the same space as a healthy process: alive, consuming resources, doing nothing useful.

The gateway stayed hung for five minutes before Horus caught it. Horus was not checking if the process was alive — it was checking if the log file had been written to in the last five minutes. Log staleness exceeded the threshold. Horus killed the process with SIGTERM. launchd detected the exit and restarted it within seconds. The gateway was back and processing requests. Total automated recovery time: under 30 seconds. Humans involved: zero.

⚔DOCTRINE

Monitoring that does not heal is just alerting. Alerting that requires a human to wake up at 3 AM is a design failure. Build the recovery action into the detection layer.

The Three-Layer Healing Architecture

Self-healing is not a single mechanism. It is a layered defense, where each layer catches what the layer below it cannot.

Layer 1: launchd KeepAlive. The process-level safety net. When a service crashes — segfault, unhandled exception, OOM kill — launchd detects the exit and restarts it within seconds. This is the fastest recovery layer. It is also the dumbest: it can only detect process death, not process dysfunction.

Layer 2: Horus Watchdog. The application-level health check. Horus does not care if the process is alive. It cares if the process is doing its job. Log staleness checks detect hung processes. HTTP checks detect unresponsive endpoints. Port checks detect services that stopped listening. Each check has a failure threshold and a cooldown. When recovery triggers, Horus sends SIGTERM and lets launchd handle the restart.

Layer 3: Sentinel External Monitor. The infrastructure-level guard. Sentinel watches Horus itself, the Docker stack, cross-service dependencies, and system-wide health. If Horus crashes, Sentinel catches it. If the entire Docker network goes down, Sentinel detects it. This layer runs on an independent notification path — if the primary alerting channel is compromised, Sentinel still reaches you.

◈INSIGHT

Each layer catches what the layer below cannot see. launchd cannot detect a hung process. Horus cannot detect its own crash. Sentinel watches them both. This is defense in depth applied to operational infrastructure.

The Horus Config Pattern

Every monitored service in Horus has four properties that govern its healing behavior:

{
  "gateway": {
    "check_type": "log_staleness",
    "log_path": "/Users/knox/.openclaw/logs/gateway.log",
    "max_stale_seconds": 300,
    "min_failures": 3,
    "cooldown_seconds": 120,
    "action": "kill",
    "pid_source": "pgrep -f 'openclaw gateway'"
  }
}

check_type defines how Horus evaluates health. Log staleness is the most versatile — a healthy process writes logs. A hung process stops writing. HTTP and port checks are more direct but less universal.

min_failures is the noise filter. A single failed check means nothing — networks blip, disks stall, garbage collection pauses. Three consecutive failures is signal. This threshold prevents heal storms triggered by transient conditions.

cooldown_seconds is the anti-bounce mechanism. After a recovery action, Horus waits before checking again. Without this, you get the kill-restart-check-fail-kill loop that turns a minor issue into a cascading outage.

action is always specific and bounded. Kill the process. Not restart it — Horus delegates restart to launchd. Separation of concerns. Horus detects and terminates. launchd resurrects.

Horus Recovery Flow

Check interval

60s

Horus polls every minute

Min failures

Consecutive — noise filter

Cooldown

120s

Post-recovery quiet period

Anti-Patterns That Will Burn You

Unbounded recovery loops. A service that crashes on startup, gets restarted by launchd, crashes again, gets restarted, and cycles forever. Without a backoff mechanism or a max-restart count, this loop consumes resources and fills logs without ever producing a healthy process. Horus's cooldown prevents this at Layer 2. At Layer 1, launchd's ThrottleInterval provides a minimum delay between restarts.

Heal storms. Five services depend on the gateway. Gateway goes down. Horus kills and restarts the gateway. During the 10-second restart window, the five dependent services all fail their health checks. Horus kills all five. launchd restarts all six simultaneously. They all race to reconnect, fail, and trigger another round. The recovery action itself has become the outage. The fix: dependency-aware cooldowns. Core services get shorter cooldowns. Dependent services get longer ones, giving the core service time to stabilize before dependents are checked.

Healing without logging. A service gets killed and restarted silently. It happens three times a day. Nobody knows because there is no record. The underlying bug — the actual cause of the hangs — never gets fixed because nobody knows it exists. Every healing action must produce a log entry: what was killed, why, what the check values were, and when the service came back.

⚠WARNING

If your system heals silently, you will never fix the root cause. Every recovery action must be logged. The log is what converts a band-aid into a diagnostic signal.

The asyncio Timeout Pattern

Most hangs in Python async services come from the same root cause: an await that never completes. A network call with no timeout. A database query against a locked table. An MCP tool invocation that hangs indefinitely.

The defensive pattern is simple and non-negotiable:

result = await asyncio.wait_for(some_coroutine(), timeout=30)

Every external call gets a timeout. Every timeout gets an exception handler. Every exception handler logs what happened and moves on. The 30-second default is not arbitrary — it is long enough for any reasonable operation and short enough to prevent cascading stalls.

This is Layer 0 — the in-process defense that prevents the hang before Horus needs to intervene. Defense in depth starts inside the application, not outside it.

Be polite, be professional, but have a plan to kill everybody you meet.
— General James Mattis · Call Sign Chaos

Every process in your stack should be polite and professional. But every process should also have a plan to kill itself if it stops functioning — and a supervisor that verifies it is still alive.

The In-Process Watchdog Thread

For long-running daemons, add a watchdog thread inside the application itself. The main event loop updates a heartbeat timestamp on every cycle. A background thread checks that timestamp every 60 seconds. If the heartbeat is stale — the main loop is hung — the watchdog thread calls os._exit(1), forcing a process exit that launchd can detect and recover from.

This converts a hang (invisible to launchd) into a crash (visible to launchd). It moves the detection from Layer 2 down to Layer 1, reducing recovery time from minutes to seconds.

import logging
import os
import threading
import time

logger = logging.getLogger(__name__)

class InProcessWatchdog(threading.Thread):
    def __init__(self, max_stale=120):
        super().__init__(daemon=True)
        self.max_stale = max_stale
        self.last_heartbeat = time.time()

    def beat(self):
        self.last_heartbeat = time.time()

    def run(self):
        while True:
            time.sleep(60)
            if time.time() - self.last_heartbeat > self.max_stale:
                logger.critical("Heartbeat stale — forcing exit")
                os._exit(1)

The daemon=True flag ensures the watchdog thread does not prevent clean shutdown. The os._exit(1) bypasses Python's cleanup machinery — because if the event loop is hung, cleanup is likely hung too.

Putting It All Together

The complete self-healing stack for any persistent service:

In-process: asyncio.wait_for() timeouts on every external call. Watchdog thread with heartbeat.
Layer 1: launchd plist with KeepAlive: true. Process crashes are auto-recovered.
Layer 2: Horus config with log staleness check, min_failures: 3, cooldown_seconds: 120. Hangs are detected and terminated.
Layer 3: Sentinel monitors Horus itself and cross-service health. Infrastructure failures escalate to notification.

◉SIGNAL

The goal is not zero failures. The goal is zero failures that require a human to wake up. Common failure modes get automated recovery. Novel failures get escalated with full context. The system handles what it can and asks for help only when it must.

Lesson 28 Drill

Pick the most critical persistent service in your stack — the one whose failure costs you the most.

Audit its current healing layers. Does it have in-process timeouts on external calls? Does launchd (or systemd, or Docker restart policies) handle crash recovery? Do you have an external health check that detects hangs, not just crashes?

For every gap you find, implement the corresponding layer this week. Start with Layer 1 if you have nothing — add KeepAlive: true to your plist. Then add asyncio.wait_for() timeouts. Then add an external staleness check with a failure threshold and cooldown.

One service. Three layers. Zero 3 AM wake-ups.

Bottom Line

Detection without healing is just observation. A system that sends you a 3 AM alert about a hung process — and then waits for you to fix it — is a system that treats your sleep as a deployment resource.

Build the healing into the detection layer. Layered recovery — in-process timeouts, launchd crash restart, Horus hang detection, Sentinel infrastructure monitoring — ensures that common failures are handled automatically and only novel failures require human judgment. The finish line is a system that fixes itself while you sleep.