My Infrastructure Writes Its Own Post-Mortems While I Sleep

The Seven-Hour Wake-Up Call

I woke up to a bot that had been silently dead for seven hours.

No crash. No alert. No error email. The process was technically alive — launchd showed it running, the port was bound, the health endpoint returned 200. From the outside, everything looked fine. From the inside, the bot had frozen mid-cycle, no trades executed, no positions updated, no signals evaluated. Seven hours of market time I'll never get back.

A stream of light frozen mid-flow, suspended motionless — a process alive on the outside but completely stalled within. ⊕ zoom

That incident was the most expensive lesson I've bought in years. Not because of the losses — the position sizing kept the damage manageable. Because of what it revealed: I had built a monitoring layer that measured the wrong thing.

I was checking liveness. I should have been checking value production.

Those are not the same thing. A process can be alive and producing exactly zero value. It can pass every health check you throw at it and still be useless. The gap between "is it up?" and "is it working?" is where operational failures live — and most engineers never build monitoring that actually closes it.

That seven-hour outage is what birthed everything I'm about to describe. It's also why I no longer trust a green status indicator unless I know exactly what that green is measuring.

What "Healthy" Actually Means

Let me be precise about the failure mode, because it's subtle and it's common.

A standard health check looks like this: ping an endpoint, get a 200, mark the service green. Or simpler: check if the process is running, mark it green. Both of those checks answer the question "did this thing start?" They do not answer "is this thing doing its job?"

Liveness is not health. Liveness is a precondition for health.

The distinction matters enormously in a production system running unattended. My trading infrastructure processes market data, writes position records to a database, emits signal evaluations with timestamps, updates trade logs. All of those are downstream artifacts — things that should exist and should be growing if the system is healthy.

A real health check asks: when did we last write a row? When did the last signal fire? Is the artifact count increasing at the rate it should be? If the answer to any of those questions is "over an hour ago," something is wrong — regardless of what the process monitor says.

This is the principle I now build every health check around: prove downstream artifact growth, not upstream process liveness. It's a harder check to write. It requires knowing what your system should produce and at what rate. But it's the only check that catches the failure mode I lived through.

Horus: The Watchdog That Fixes What It Finds

After the seven-hour incident, I built a watchdog. I named it Horus — the Egyptian god with the all-seeing eye. The name felt right.

A warm pulse of light surging back through a darkened conduit, re-igniting it end to end — kill and restart. ⊕ zoom

Horus is deliberately simple. It does three things:

Log staleness check. It reads the last-modified timestamp on a service's log file. If that timestamp hasn't moved in the configured threshold — say, 10 minutes for an active trading bot — something is wrong.
HTTP health check. It pings the service's health endpoint and validates the response, including response body content where meaningful.
Port liveness check. It confirms the service is bound to its expected port.

If any of those checks fail, Horus kills the process and restarts it. No waiting. No retrying. Kill, restart, log. The assumption is that a stale process is a dead process, and the correct response is to start fresh.

This sounds aggressive. It is. And it's the right call for a system where the cost of a false-positive restart (a few seconds of downtime) is far lower than the cost of a false-negative (another seven-hour outage).

The restart is not the clever part. Any process supervisor can restart a crashed service. The clever part is the staleness detection — catching the case where the process hasn't crashed but has stopped working. That's the gap that launchd's KeepAlive=true alone cannot fill. KeepAlive restarts a process that exits. It cannot detect a process that is running and producing nothing.

Horus fills that gap.

WATCHDOG SERVICE SELF-HEALING
continuous monitor · detect · kill · restart · verify · document
1. MONITOR
log staleness · HTTP health · port liveness
→
2. DETECT STALE
process alive but frozen / no downstream growth
→
3. KILL
SIGTERM the hung process
→
4. RESTART
launchd KeepAlive brings it back
↓
7. FILE TO SEMANTIC MEMORY
memory write — the fleet gets smarter
←
6. WRITE POST-MORTEM
what failed · root cause · recovery
←
5. VERIFY HEALTHY
confirm real work resumes (rows/trades, not just 200 OK)
↺back to MONITOR
Liveness is not health. The watchdog service checks downstream artifact growth, heals, and documents — while I sleep.

The launchd Layer: Where macOS Reliability Lives

Everything in my fleet runs as a launchd plist with KeepAlive set to true. If you're building persistent services on macOS and you're not using launchd, you're building on sand.

But launchd has quirks that will burn you if you don't know them.

The PATH problem. A launchd job does not inherit your shell's PATH. If your service calls python3 or docker and those live in /opt/homebrew/bin, launchd cannot find them unless you explicitly set the PATH in the plist's EnvironmentVariables dictionary. I have broken a service by forgetting this. The symptom is confusing: the plist loads, the job starts, and then it silently fails because a binary it depends on isn't on the path. The fix is always the same — add /opt/homebrew/bin to the front of PATH in the plist.

The TCC problem. macOS's Transparency, Consent, and Control framework will block a launchd service from accessing files in locations like Documents or Downloads unless the backing executable has been granted access. The shell you use in the plist matters: I always specify /opt/homebrew/bin/bash rather than /bin/bash, because the Homebrew bash has consistent TCC behavior in my stack.

The environment variable problem. When you migrate a service from Docker Compose to launchd, every environment variable that was in your docker-compose.yml environment block must be manually transferred to the plist's EnvironmentVariables dict. Compose sets them automatically. launchd does not. I have migrated services and watched them fail silently for hours because a single required env var didn't make the trip. This is now a mandatory checklist item: before any launchd service is declared healthy post-migration, audit every env var the service reads and verify it exists in the plist.

These are the kind of operational details that don't appear in any tutorial. They appear in post-mortems. Which brings me to the more interesting part of this system.

Invictus Sentinel: The Fleet-Level Intelligence Layer

Horus is a watchdog. Invictus Sentinel is something more ambitious.

Where Horus focuses on individual service health with simple restart logic, Sentinel operates across the entire fleet — 50+ monitors spanning trading bots, content pipelines, Discord integrations, data ingest services, the AI memory system, and the API layer. It correlates signals across services, tracks incident history, and does something I have not seen described anywhere in the standard SRE literature at the solo-operator scale:

It writes its own post-mortems.

When Sentinel detects and resolves an incident, it doesn't just log "service X restarted at timestamp Y." It constructs an analysis. What failed. What the probable root cause was, based on what it knows about the service's dependencies and recent change history. What it did to recover. What the recovery timeline looked like. Whether the same failure pattern has appeared before.

That analysis gets filed to Akashic — my semantic memory system that indexes operational knowledge across the entire stack. So the next time Sentinel encounters a similar failure pattern, it has context. The system doesn't just fix the same problem twice. It recognizes that it's the same problem and adjusts its response accordingly.

This is the compounding effect I care about most. Every incident makes the fleet smarter. Not because I sat down and wrote a runbook. Because the system wrote it.

The Operational Calculus of Running Solo

I want to be direct about why all of this matters, because it's easy to read this as infrastructure nerdery and miss the point.

I run this entire operation alone. No SRE team. No on-call rotation. No NOC. Just me, a fleet of always-on services, and a lot of automation.

The systems that run here include trading bots executing real money on Polymarket prediction markets and crypto perpetual futures, content pipelines generating and publishing articles, Discord bots managing community channels, AI agents processing incoming requests around the clock. Any one of these failing silently costs me — in money, in reputation, in user trust, in the compounding effects of a data gap that can't be backfilled.

The only way a solo operator runs a fleet like this is if the fleet can run itself. Not partially. Not "mostly." Actually run itself — detect its own failures, recover from them, document them, and hand me a summary when I wake up.

That's not a nice-to-have. It's the architectural prerequisite for everything else I'm trying to build. Without self-healing infrastructure, every ambitious project I start becomes a liability the moment I step away from the keyboard.

The self-healing layer is what buys me the operational slack to build new things. Every hour I'm not debugging a stale process is an hour I can spend on the next project.

What the Post-Mortem Looks Like

Here's a concrete example of what Sentinel produces after an incident.

A content pipeline service recently went stale — it was running, passing HTTP health checks, but had not produced any output files in over two hours. Horus's staleness check caught it. Sentinel received the event, killed and restarted the service, confirmed the restart via artifact growth (new files appeared within three minutes), and then generated this analysis:

Service: content-pipeline-v2 Failure detected: 03:14 UTC — log staleness exceeded 120-minute threshold Process state at detection: Running, port bound, HTTP 200 Root cause (probable): Upstream API rate limit reached without graceful backoff; service entered infinite retry wait loop without logging heartbeat Recovery action: SIGKILL + launchd restart Recovery confirmed: 03:17 UTC — artifact count resumed growth (3 new files in 4 minutes) Prior incidents matching this pattern: 2 in last 90 days (same failure mode) Recommendation: Add explicit rate-limit detection and exponential backoff with jitter; log a heartbeat every 60 seconds regardless of work state

That recommendation went into Akashic. When the developer context for that service is loaded next session, that recommendation is surface-level — it's part of the knowledge about how that service behaves. It doesn't require anyone to remember the incident or find the log file.

The infrastructure is training itself.

The Principle That Ties This Together

I've been building toward a single architectural thesis for the better part of a year, and the self-healing layer is where it becomes most visible:

A system that can only be healthy when you're watching it is not a system. It's a job.

Jobs don't scale. Systems do.

The difference between a job and a system is not the technology stack. It's whether the system can maintain its own invariants without human intervention. Can it detect when something has gone wrong? Can it recover without waking you up? Can it explain what happened so you can make it better?

Horus answers the first two questions. Sentinel answers the third.

And the depth of the answer matters. Logging "service restarted" is table stakes. Writing an analysis of why it failed, what the recovery timeline was, and whether this is a repeat pattern — that's the difference between infrastructure that records history and infrastructure that learns from it.

The seven-hour outage taught me that liveness is a lie if you're not also measuring production. Building Horus taught me that restart logic is only half the battle — you have to catch the silent failures too. Building Sentinel taught me that the most valuable thing a monitoring system can do is not just fix the problem but document the lesson.

Every incident is a lesson. The question is whether you capture it.

I now have a system that captures every one of them, files them semantically, and surfaces the relevant ones at the start of every session. My infrastructure is not just self-healing. It's self-improving.

That is the only architecture that makes sense for a solo operator running a production fleet. And it's the only reason I can build new things without the existing things falling apart.

Build infrastructure that heals itself. Then build infrastructure that explains itself. The second is harder than the first, and it matters more.