The Bot Never Sleeps: Zero-Downtime Hot Reload for Live Trading Systems

Every deployment guide in existence assumes you can afford to stop the thing you're deploying to.

That assumption breaks the moment your system is generating revenue on a 5-second poll cycle, holding open positions across 16 active market slots, and running 24/7 on a server you can't babysit. The Polymarket bot — Foresight — runs BTC, ETH, SOL, XRP, DOGE, AVAX, LINK, and MATIC on 5-minute and 15-minute timeframes, placing bets on binary price direction markets. It does not care about your deployment window. The markets don't close. The positions don't pause. And a snipe window that triggers during a restart catches nothing but silence.

The conventional answer is blue-green deployment: spin up a parallel instance, cut traffic over, kill the old one. That works beautifully for stateless web services. It's architecturally illiterate for a trading bot with live WebSocket feeds, in-flight market evaluations, and shared state tracking which of your 16 slots are currently occupied.

So I built a different answer.

The Control File Pattern Solves the Wrong Problem First

The breakthrough wasn't some exotic Python interpreter trick. It was recognizing that the restart problem is actually two separate problems wearing the same coat.

Problem one: you need to change the bot's behavior without interrupting its event loop. Problem two: you need to communicate intent to a running process without killing it.

Most engineers go straight at problem one — hot-swapping modules, reloading Python files at runtime, reimporting strategy classes. That's the technically interesting problem, so that's where attention goes. But problem two is what creates the architecture. Solve it first and problem one becomes tractable.

The implementation I shipped in PR #124 uses a pair of shared JSON files: control.json and status.json. The bot's asyncio event loop polls control.json on every cycle. If it sees a reload flag, it triggers the reload sequence internally — no signals, no restarts, no process manager gymnastics. The Mission Control frontend writes to control.json via the backend router (bot_control.py), which exposes 8 endpoints for runtime control. The bot reads. The UI writes. The process boundary stays intact.

◈INSIGHT

The shared JSON approach trades elegance for operational clarity. Any engineer who can open a file can understand the control surface. That debuggability is worth more at 2am than any IPC mechanism.

This is the control file pattern — and it's not new. Unix daemons have been reading config files for state changes since before most engineers writing Python today were born. What's new is applying it deliberately to a financial system where the cost of getting the handshake wrong is a missed position, not a 404.

The Asyncio Event Loop Is Your Deployment Surface

Here's the head-fake: hot reload sounds like a deployment problem. It's actually a concurrency problem.

Foresight runs as a single asyncio event loop with a 5-second market scanner poll and Binance WebSocket candle feeds running in parallel. The event loop is not just the runtime — it's the scheduling authority for everything the bot does. When a reload happens, you're not just swapping code. You're negotiating with a cooperative multitasking scheduler that has its own opinion about when tasks yield.

The naive implementation reloads on the next loop tick, which means you're potentially reloading mid-evaluation if a market assessment coroutine is in-flight. Early versions had exactly this problem: the reload would complete, but the strategy object the in-flight coroutine held a reference to was now orphaned — a ghost of the previous version, finishing its work against market data the new strategy code would handle differently.

The fix required treating the reload as a cooperative yield point — a designated moment in the control flow where the loop knows it is safe to swap state. After the market scanner completes its 5-second poll and before it schedules the next one, the loop checks for a reload flag. If set, it tears down the old strategy objects cleanly, reimports the updated modules, reconstructs the strategy instances, and resumes. No in-flight evaluations get orphaned. No positions get stranded mid-assessment.

He who can handle the quickest rate of change survives.
— John Boyd · Patterns of Conflict

Boyd was talking about OODA loops in aerial combat. The principle maps exactly. The bot that can absorb a strategy update without breaking stride has a structural advantage over one that requires a maintenance window.

Position State Is the Only Thing That Cannot Be Reconstructed

Every other piece of system state — market data, candle feeds, the current signal evaluations — is ephemeral. Lose it on a reload and the next poll cycle reconstructs it from Binance and Polymarket API calls. No meaningful information is lost. The 5-second recovery is not a problem.

Open positions are different. The bot tracks which of its 16 slots are occupied, what was placed, at what odds, with what expected resolution window. If a reload nukes that state, the bot comes back up thinking it has 16 empty slots and immediately starts placing into markets it's already committed to. The result is position overlap — two bets on the same market direction at different points in the same window, compounding exposure that was never intended to compound.

The solution is status.json functioning as a persistence membrane — a lightweight checkpoint file that the bot writes to on every state change and reads from on every reload. Before any module reimport, the reload sequence reads the current status.json and holds those position references in memory. After reimport, it injects them back into the reconstructed strategy instance. The new code picks up mid-flight, with full awareness of what was already committed.

Active Market Slots

Tracked across reload cycles without position loss

This is the design decision I would've gotten wrong on the first attempt if I hadn't spent time thinking through failure modes before writing code. The temptation is to treat reload as a fresh start — clean state, clean slate. But for a revenue-generating system, "clean state" is a bug disguised as a feature. The positions that existed before the reload are real money in real markets. The new code needs to inherit that context, not ignore it.

Deployment Velocity Is a Trading Edge

The generalized principle here extends well past trading bots.

Any system where downtime has asymmetric cost — where the loss from being offline exceeds the cost of operating on slightly stale code — needs to treat zero-downtime deployment as a first-class architectural requirement, not a DevOps afterthought bolted on at the end. The typical engineering team gets this backwards: they build the system, then figure out how to deploy it. The deployment constraints should shape the architecture from the first design session.

⚔DOCTRINE

The system that can update itself without stopping is fundamentally more resilient than the system that cannot — not because restarts are catastrophic, but because the discipline required to build hot reload forces you to model your own state correctly.

For Foresight specifically, the operational impact was immediate. Strategy updates that previously required a maintenance window — pausing the bot, deploying, restarting, watching it re-establish WebSocket connections — now ship in under 30 seconds with zero market exposure gap. The strategy audit findings from February 27 identified improvements to the snipe-window timing logic that would have been sitting in a staging branch waiting for a safe restart window. Instead, they went live the same session they were tested.

Deployment velocity is not a convenience metric. When your system is calling market direction at 91.3% win rate across 100 live trades, every hour the improved strategy sits undeployed is optionality you're leaving on the table.

The deeper lesson from building this: the quality of your deployment architecture is a direct reflection of how well you understand your own system's state. You cannot build hot reload for a system whose state you haven't fully modeled. The constraint forces the clarity.

The Control File Pattern Solves the Wrong Problem First

The Asyncio Event Loop Is Your Deployment Surface

Position State Is the Only Thing That Cannot Be Reconstructed

Deployment Velocity Is a Trading Edge

Follow the Signal

5 AI Agent Design Patterns That Survive Production

AI Agent Observability: Monitoring 325 Agents Without Watching Them

The AI Agent Tech Stack Behind 325 Agents in Production