Shared Rate Limiting: When Five Processes Share One API Quota | AI Academy

You build a blog autopilot that calls Anthropic's API. You build a smart-engage pipeline that calls Anthropic's API. You run OpenClaw, which calls Anthropic's API. You spawn Claude Code sessions that call Anthropic's API. Each one has its own rate limiter. Each one thinks it is the only process on the machine.

None of them are. And when all four fire within the same 60-second window — which they will, because cron jobs cluster and pipelines overlap — your combined request rate exceeds the provider quota. You start getting 429s. Your retries make it worse. Your costs spike because the retries that do succeed are now contending with the next scheduled batch.

This is not a rate limiting problem. This is a coordination problem.

Multi-Process API Contention

The Multi-Process Collision

The failure mode is invisible until it hits. Each process operates correctly in isolation. Each process respects its own limits. The math only breaks when you add them together.

Consider a real production stack with five concurrent API consumers:

Each process might be well within its own limits. But the provider does not see five polite processes. The provider sees one API key making 250 requests per minute when the quota is 100. The 429 responses begin. And because each process retries independently, the retry traffic makes the contention worse.

⚠WARNING

Per-process rate limiting is a false safety net. Five well-behaved processes can collectively produce a rate-limit disaster. The quota belongs to the API key, not to the process.

The Shared State Architecture

The fix is a single source of truth for rate consumption. Every process reads from and writes to one shared state file before making any API call.

{
  "anthropic": {
    "requests": [
      {"ts": 1709280001, "model": "claude-sonnet-4-6", "caller": "blog-autopilot"},
      {"ts": 1709280003, "model": "gemini-2.0-flash", "caller": "smart-engage"},
      {"ts": 1709280005, "model": "claude-opus-4-6", "caller": "claude-code"}
    ],
    "window_seconds": 60,
    "max_requests": 80,
    "last_429": null,
    "backoff_until": null
  }
}

The state file lives at ~/.config/anthropic-ratelimit/state.json. Every process follows the same protocol:

Acquire file lock (fcntl exclusive lock). This blocks other processes from reading or writing simultaneously.
Read current state. Count requests within the rolling 60-second window.
Check if under quota. If the rolling count is below max_requests, proceed. If not, calculate wait time and sleep.
Check backoff. If a 429 was recently received and backoff_until is in the future, sleep until the backoff expires.
Record the request — timestamp, model, and caller identity.
Release file lock.
Make the API call.

The file lock is the critical mechanism. Without it, two processes can read the same count, both decide they are under quota, and both fire — pushing the combined count over the limit. File locking makes the read-check-write sequence atomic across all processes on the machine.

Rolling window

60s

Requests older than 60s are pruned

Max requests

80/min

Set below provider quota (100) for safety margin

Backoff multiplier

Exponential on consecutive 429s

The Rolling Window

The state file uses a rolling window, not a fixed calendar window. This distinction matters.

A fixed window resets at a hard boundary — midnight, top of the minute, top of the hour. A burst of 80 requests at 11:59:59 followed by another burst at 12:00:01 sends 160 requests in two seconds and looks legal to a fixed-window limiter. The provider disagrees.

A rolling window tracks every request timestamp and counts how many fall within the last N seconds. There is no reset boundary to exploit. The count at any moment reflects the actual recent load.

When a process reads the state file, it first prunes entries older than window_seconds. What remains is the true current consumption. If that count exceeds max_requests, the process sleeps until enough entries age out to create headroom.

The 429 Backoff Protocol

When any process receives a 429 response, it writes the backoff state to the shared file:

{
  "last_429": 1709280120,
  "backoff_until": 1709280180,
  "consecutive_429s": 1
}

Now every process — not just the one that got the 429 — respects the backoff. This is the key difference from per-process backoff. When process A gets rate-limited, processes B through E also pause. They do not independently discover the same wall and pile more 429s onto the provider.

Consecutive 429s double the backoff period: 60 seconds, then 120, then 240. The escalation continues until a request succeeds, which resets the counter to zero. This is system-wide exponential backoff, not per-process backoff.

⚔DOCTRINE

A 429 response is not private information. It is a signal about the shared quota. When one process gets rate-limited, all processes must back off — because the quota they share is the same quota that is exhausted.

Model Routing as Rate-Limit Strategy

Lesson 9 covered model routing as a cost discipline. Here is the rate-limit angle.

Every provider has per-model rate limits. Anthropic limits Opus calls more aggressively than Sonnet calls, and Sonnet more than Haiku. If four processes are all routing to Sonnet, they saturate the Sonnet quota while the Flash and Haiku quotas sit unused.

Model routing is rate-limit arbitrage. Route the simple tasks — classification, extraction, formatting — to Flash, where the quota is generous and the cost is near zero. This preserves the Sonnet and Opus quota for the tasks that actually need it. The blog-autopilot gets its Sonnet calls. Claude Code gets its Opus calls. Neither is contending with smart-engage's 40 classification calls that should have been Flash all along.

◈INSIGHT

Rate limiting and model routing are two sides of the same coin. Routing simple tasks to cheaper models does not just save money — it preserves quota headroom for the tasks that need expensive models. Every Flash call is a Sonnet call saved for later.

The Cost Explosion Pattern

Here is the bill scenario nobody plans for.

Four processes run daily. Each makes moderate API calls. In isolation, each costs about $2/day. Total expected: $8/day, $240/month. Manageable.

But when they collide — when three cron jobs fire at the same time and trigger retries against the rate limit — the actual pattern is: 100 primary calls + 60 retried calls + 30 re-retried calls = 190 calls. The retries are not free. They consume the same tokens. They hit the same billing meter. A 429 that gets retried twice costs three times the original call, and only one of those calls produces useful output.

Without shared rate limiting, the retry storm is invisible until the invoice arrives.

No battle plan survives contact with the enemy. The key is not the plan but the ability to adapt.
— General Colin Powell · My American Journey

Your rate-limit plan will meet reality the moment your second cron job fires. The shared state file is your adaptation mechanism — it absorbs the collision and distributes the delay, so no process fights the provider wall alone.

Anti-Patterns

Per-process rate limiting. Each process tracks its own requests. Combined, they exceed the provider quota. This is the default behavior of every HTTP client with a built-in rate limiter. It works for single-process applications. It fails the moment you scale to two.

No rate limiting at all. The hope-based strategy: "our volume is low enough that we won't hit limits." This works until it doesn't — until a retry loop fires, a cron job overlaps, or you spawn a fourth Claude Code session during a complex refactor.

Retry without backoff. Getting a 429 and immediately retrying is not persistence. It is aggression. The provider returned 429 because you hit the limit. Retrying instantly hits the limit again. Each retry resets the provider's rate-limit window, extending the penalty. Retry without backoff makes 429s longer, not shorter.

Ignoring Retry-After headers. When a provider returns Retry-After: 30, that is not a suggestion. It is the provider telling you exactly how long to wait. Ignoring it and retrying at your own cadence burns goodwill and risks account-level throttling.

◉SIGNAL

Treat the shared state file as infrastructure, not a nice-to-have. Every process that makes API calls must participate. A single uncoordinated process can blow the quota for everyone else.

Implementation Checklist

Create the state directory: ~/.config/anthropic-ratelimit/
Implement the lock-read-check-write-unlock cycle in a shared library that every API caller imports.
Set max_requests at 80% of the provider's stated limit. Leave a 20% safety margin for request timing jitter.
Log every rate-limit pause with the caller name, current count, and wait duration. This data drives future tuning.
Add the state file path to your backup exclusions — it is ephemeral operational state, not persistent data.
Monitor the state file's 429 counter. If it is non-zero more than once a week, your max_requests is set too high or your routing needs adjustment.

Lesson 29 Drill

List every process on your machine that makes API calls to the same provider. Count them. Multiply each process's average request rate by the number of processes. Compare that combined rate to the provider's stated quota.

If the combined rate exceeds 80% of the quota, you have a contention risk. Implement a shared state file this week. If the combined rate is below 80%, set a calendar reminder to re-check when you add the next automated pipeline — because the next pipeline is the one that will push you over.

Bottom Line

Per-process rate limiting is a local solution to a global problem. The API quota does not belong to any single process — it belongs to the API key, which every process on your machine shares.

The shared state file — file-locked, rolling-window, system-wide backoff — is the coordination mechanism that prevents five well-behaved processes from collectively misbehaving. Build it before the 429 storm hits, because by the time you are debugging a retry avalanche at 2 AM, the damage is already on your invoice.