Engineering

The Bot Never Sleeps: Hot Reload Architecture for Live Trading Systems

Most deployment strategies assume you can afford to restart. A live trading bot with open positions cannot. Here's the architecture that changed how we ship.

April 12, 2026
8 min read
#hot-reload#zero-downtime#trading-systems
The Bot Never Sleeps: Hot Reload Architecture for Live Trading Systems⊕ zoom
Share

Killing a process is the easiest thing a deployment pipeline can do. It's also the most expensive thing you can do when that process is holding 16 active trading positions across six crypto assets with money on the line.

We merged PR #124 on March 2nd. The Polymarket bot has not required a single restart for a code update since. That's not a convenience feature — it's a correctness requirement. The architecture decision that made it possible reveals something fundamental about how stateful financial systems should be designed from the start.

Restart Is Not a Deployment Strategy — It's an Assumption

Every standard deployment pattern I've seen in production codebases assumes the same thing: the service can be killed cleanly, the new version boots, and the world continues. For stateless services — API handlers, webhooks, data processors — this assumption is harmless. The worst case is a few dropped requests and a momentary gap in metrics.

For a trading system, the assumption is catastrophically wrong.

The Polymarket bot runs a continuous asyncio event loop polling 16 market slots every 5 seconds across 5-minute and 15-minute timeframes. At any given moment, it may have entered a snipe window — a narrow timing gate where a position must be executed within seconds of a signal trigger or the entry edge disappears. A restart doesn't just pause execution. It loses context: which markets are in active evaluation, which signals are mid-confirmation, which positions were placed but haven't received their outcome callbacks yet.

WARNING

A restart during a snipe window doesn't just miss a trade. It orphans state. The bot comes back online with no memory of what it was doing, potentially double-entering markets that were already live or missing exits on open positions.

The conventional answer — "just build better restart logic with state persistence" — solves the wrong problem. Writing application state to disk on shutdown and reloading it on boot is a recovery pattern, not a deployment pattern. It's designing for failure instead of designing for continuity. The distinction matters because recovery patterns have failure modes; continuity patterns don't need to.

The Shared File Bus Is Not a Hack — It's an Interface Boundary

When we built the bot control panel in PR #81, the architectural choice that would later enable hot reload was the communication layer: instead of direct IPC, in-memory queues, or a message broker, we routed runtime control through two shared JSON files — control.json and status.json.

Eight backend endpoints write to control.json. The bot's main loop reads it. The bot writes execution state to status.json. The frontend reads that. This is a file-mediated control plane — and it reads like a step backward until you follow the consequence chain forward.

A shared file is not just a simple IPC mechanism. It's a persistence boundary that survives process restarts by definition. Any runtime state you write through this interface is inherently durable. When the new version of the bot process starts, it doesn't need a handoff protocol from the old process — it just reads the file that was already there. The state never lived in process memory to begin with.

This is where the hot reload design comes from. It's not a clever trick applied after the fact. It's the natural consequence of choosing the right abstraction for runtime control at the start.

Operate inside the adversary's OODA loop — not by moving faster, but by making their observations irrelevant.

John Boyd · Patterns of Conflict

The "adversary" in a deployment context is the state discontinuity that kills services during updates. A file-mediated control plane makes process identity irrelevant. The new process doesn't need to know what the old process was doing — it reads the same ground truth the old process was writing to.

The Asyncio Event Loop Does Not Care Which Version You Are

The bot's runtime is a Python asyncio event loop with Binance WebSocket feeds and a 5-second market scanner. This architecture has a property that makes hot reload tractable: all meaningful state is external.

The Binance WebSocket feeds are reconnected on startup — the exchange holds the canonical candle data, not the bot. The market evaluation logic reads from control files and external price feeds each cycle. The 16 active slots are defined in configuration, not accumulated in memory over the process lifetime. When the new version starts, it doesn't need to reconstruct internal state because there is no internal state worth reconstructing.

INSIGHT

This is the deep design principle: a system that can be hot-reloaded was designed to treat its own process memory as a cache, not a source of truth. If your process is the source of truth for anything critical, you have already paid the restart tax — you just haven't been billed yet.

The actual hot reload mechanism in PR #124 works because of this property. The file watcher detects source changes, signals the process manager, and the new version boots while the old version completes its current evaluation cycle. Because no critical state lives exclusively in-process, the transition window is just one 5-second poll interval. No positions are orphaned. No signals are lost mid-confirmation. The bot running on Tesseract 24/7 continues executing without a gap.

The implementation is not complex. What's complex is the prior architectural discipline that makes the implementation possible.

What This Means for Any Stateful Financial Service

The pattern generalizes beyond trading bots. Any system handling real financial transactions — order management, position sizing engines, settlement processors — carries the same restart risk. The standard advice is "design for graceful shutdown." That's necessary but insufficient.

Graceful shutdown assumes you control the timing of the shutdown. In production, you often don't. Infrastructure failures, OOM kills, forced deploys under pressure — these don't honor shutdown hooks. The architecture needs to be safe even when the shutdown is ungraceful.

The design principle that survives this: externalize any state that must survive a process boundary. This is not the same as "persist everything." It means identifying exactly which pieces of state are semantically required for correctness across a restart boundary, and making those pieces live somewhere that isn't your process heap.

Bot Uptime Since PR #124
41 Days
Zero restarts required for code updates

For the Polymarket bot, that list is short: which markets are enabled, the current control flags, and the last known position outcomes. Everything else — signal calculations, candle buffers, evaluation state — is either recomputable from external feeds in one poll cycle or cheap enough to lose. The file-mediated control plane handles the short list. The rest rebuilds itself.

The 91.3% win rate across live trades is a strategy metric. The zero-downtime deploy is an infrastructure guarantee. Both matter. The second one is what lets you keep improving the first one without ever forcing a choice between shipping and staying live.

Hot reload isn't about being clever with deployment tooling. It's about having already made the right architectural call before you ever needed it.

Go deeper in the AcademyOperator

The engineering patterns in this article are covered in the AI Infrastructure track — persistent platforms that run themselves. 11 lessons.

Start the AI Infrastructure track →

Explore the Invictus Labs Ecosystem

// Join the Network

Follow the Signal

If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

No spam. Unsubscribe anytime.

Share
// More SignalsAll Posts →