Engineering

Thoth: How I Built an Automated Documentation System That Caught Up 455 PRs in One Night

47 repos. 455 merged PRs. 24 knowledge base docs generated automatically. Documentation doesn't drift when a god of knowledge is watching.

February 28, 2026
8 min read
#building-in-public#documentation#automation
Thoth: How I Built an Automated Documentation System That Caught Up 455 PRs in One Night
Share

Nobody wakes up excited to write documentation.

You merge the PR. The tests pass. CI goes green. You move on to the next feature. Meanwhile, the knowledge base — that collection of markdown files that supposedly describes what your system does — falls another commit behind. Multiply this across 47 repositories and 72 services, and you don't have a documentation problem. You have a documentation black hole.

I named the service that fixes this Thoth — after the Egyptian god of knowledge, writing, and record-keeping. Because if documentation is going to stay current, something has to be keeping the books. And it can't be you.

The Problem: Silent Drift at Scale

Here's what I discovered when I audited my knowledge base in late February 2026:

Repos in the Org
47
Invictus-Labs GitHub
Apps in Portfolio
72
across 7 categories
KB Docs on Disk
13
covering 18% of the ecosystem
Broken References
15
dead links in portfolio.json

Thirteen documents for seventy-two services. That's an 82% documentation gap. And it wasn't because nobody cared — every PR we ship has a structured body with Summary, Architecture, Test Plan, and a CodeRabbit walkthrough. The raw material for documentation was already being written. It was just trapped inside GitHub PR descriptions, where it would never surface again after the merge.

WARNING

The paradox: We were writing more documentation than ever — inside PRs. But none of it was reaching the knowledge base. The act of merging was burying the docs.

The knowledge base is supposed to be the canonical source of truth. When a new team member — or a future version of yourself — needs to understand what polymarket-bot does, they shouldn't have to dig through 92 closed PRs to piece it together. That's not documentation. That's archaeology.

The Design: No LLM Required

The first instinct when automating documentation is to throw an LLM at it. Feed it the code, ask it to describe what it does. But I've learned from running 7 AI agents 24/7 that LLMs are expensive, slow, and hallucinate. And in this case, unnecessary.

Why? Because the PR bodies already contain exactly what the documentation needs:

  • Summary sections with bullet points describing changes
  • Architecture tables with file paths, additions, deletions
  • Test plan checklists verifying behavior
  • CodeRabbit walkthroughs explaining every file change

All Thoth needs to do is extract, template, and publish. No generation. No interpretation. Pure structured text transformation.

DOCTRINE

Design principle: If the raw material is already structured, don't generate — transform. Save the LLM credits for problems that actually require intelligence.

The Architecture: Four-Stage Pipeline

Thoth runs as a daily cron job at 11:30 PM ET, after the day's PRs have landed. Four stages, each with a single responsibility:

THOTH — DOCUMENTATION SYNC PIPELINE
Daily cron · 11:30 PM ET · No LLM required
GitHub · Invictus-Labs
47 repos · merged PRs · gh CLI
STAGE 01
Scanner
gh pr list
5s timeout · skip archived
STAGE 02
Analyzer
Classify PRs
≥50 adds · no .md changes
STAGE 03
Generator
Extract + Template
parse body · sanitize tables
STAGE 04
Publisher
gh pr create
max 5 per run · human review
Knowledge Base
kb/projects/*.md · doc PRs
report.json
scan results · KB health metrics
Mission Control
/doc-health panel · stat cards

Stage 1: Scanner

Hits the GitHub API via gh CLI across all 47 Invictus-Labs repos. For each repo, it lists PRs merged on the target date, pulling the full JSON payload — title, body, files changed, additions, deletions, merge timestamp. Skip archived repos. 5-second timeout per call. The scanner doesn't make decisions; it just collects.

Stage 2: Analyzer

Classifies each PR. The logic is deliberately simple:

  • Needs docs if: more than 50 additions AND no documentation files (.md) already changed in the PR
  • High priority: feat: prefix or 200+ additions (new capabilities)
  • Medium priority: Bug fixes with root cause analysis, significant refactors
  • Skip: PRs under 20 additions, Dependabot bumps, PRs that already touch docs

The analyzer also maps each repo to its existing knowledge base document (if one exists) or determines the target path for a new one. No ambiguity — the output is a list of PRs with needs_docs: true and a target file path.

Stage 3: Generator

The pure text transformation engine. For each undocumented PR:

  1. Parse the PR body — extract ## Summary, architecture tables, test plans
  2. If a KB doc already exists → append a new entry to the ## Recent Changes table
  3. If no KB doc exists → generate one from the template: Overview, Status, Location, Architecture, Recent Changes

Every piece of text destined for a markdown table gets sanitized — whitespace collapsed, pipes escaped, truncated to 120 characters. This is a lesson I learned the hard way when multi-line bullet lists from PR summaries broke table rendering across an entire page.

Stage 4: Publisher

Creates the actual documentation PRs. For each generated doc:

  1. Clone the knowledge-base repo to a temp directory
  2. Create a branch: docs/sync-{date}-{repo}-{pr_number}
  3. Write the generated markdown file
  4. Commit, push, open a PR via gh pr create
  5. Rate limit: maximum 5 doc PRs per run to avoid noise

The PRs are created but NOT auto-merged. Every doc update gets human review. Thoth proposes; Knox disposes.

The Catchup Sweep: 455 PRs in One Night

The first real test wasn't a daily run — it was a catchup sweep. Seven days of accumulated PRs, February 21 through 27, 2026.

PRs Scanned
455
across 7 days
PRs Needing Docs
87
19% documentation gap
KB Docs Generated
24
11 new + 13 updated
Doc PRs Merged
50+
across 4 convergence passes

The sweep ran in waves. Each pass created up to 5 doc PRs, waited for CodeRabbit review, resolved feedback, merged. Then closed any PRs that conflicted with the newly merged state, and ran again. Four passes to converge on a fully documented knowledge base.

CONVERGENCE PATTERN — BATCH MERGE
Multiple PRs editing the same file require sequential merge passes
PASS
CREATED
MERGED
CONFLICTS
STATUS
Pass 1
created 35
merged 11
conflicts 24
CONFLICTS
Pass 2
created 12
merged 9
conflicts 3
CONFLICTS
Pass 3
created 5
merged 4
conflicts 1
CONFLICTS
Pass 4
created 2
merged 2
conflicts 0
CONVERGED
Create PRs
CodeRabbit Review
Merge
Close Conflicts
Re-run
Budget N/5 merge passes where N = total PRs. Each merge invalidates overlapping branches.
INSIGHT

Convergence pattern: When multiple PRs edit the same file (e.g., appending changelog entries to mission-control.md), merging one invalidates the others. You can't batch-merge — you have to merge sequentially, close conflicts, and re-run. Budget for N/5 merge passes where N = total PRs.

Three Bugs That Taught Me Three Rules

Building Thoth was straightforward. Running it at scale was not. Three bugs surfaced during the catchup sweep that each became permanent rules.

1. Table Rendering

Multi-line PR summaries — the kind with bullet lists from ## Summary sections — broke markdown table cells when inserted directly. A table row is a single line. Bullet points are not. The fix: a _sanitize_for_table() helper that collapses all whitespace to a single line, escapes pipe characters, and truncates to 120 characters.

Rule: Always sanitize text destined for markdown table cells.

2. Absolute Path Resolution

Generated KB docs were being written to Users/knox/Documents/Dev/knowledge-base/kb/projects/ inside the cloned repo — an absolute path turned into a deeply nested relative path. The fix: don't strip the filesystem anchor and hope for the best. Find the semantic root (kb/) in the path parts and reconstruct from there.

Rule: When converting absolute paths to relative paths within a cloned repo, find the semantic root.

3. Branch Name Collision

When multiple PRs from the same repo on the same day needed docs, the publisher pushed the same branch name for each. The first succeeded; the rest failed. The fix: append the PR number to the branch name.

Rule: Unique inputs demand unique identifiers. Date + repo is not unique when you're shipping fast.

The Mission Control Panel

Thoth writes a report JSON after every run. Mission Control reads that report and surfaces it on a dedicated Doc Health panel — four stat cards (PRs Merged, Documented, Doc PRs Created, KB Coverage), a PR activity table color-coded by documentation status, and a 7-day trend chart.

The panel is read-only. No scanning, no computation. Thoth does the work; Mission Control displays the results. This follows the same pattern as every other panel in the dashboard — each backend router reads a JSON file written by an upstream service.

SIGNAL

KB coverage went from 18% to 46% in one night. That's still not 100%, but it's a trajectory. Thoth runs every night. The gap only shrinks.

Documentation Hygiene Is Infrastructure

Most engineering teams treat documentation as a chore. Something you do after the real work is done. But documentation is infrastructure — it's the substrate that enables onboarding, debugging, knowledge transfer, and architectural decision-making. When it drifts, every other process degrades.

The insight behind Thoth is simple: documentation should be a side effect of shipping, not a separate activity. If your PRs already describe what changed and why, you don't need to write docs. You need a system that extracts the docs from the PRs and puts them where people will actually find them.

That system is Thoth. Named for the god who invented writing — because someone has to keep the records. And it shouldn't be you at midnight after a long day of shipping features.

The cron runs at 11:30 PM. By morning, the knowledge base is current. That's the whole point.

Explore the Invictus Labs Ecosystem

// Join the Network

Follow the Signal

If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

No spam. Unsubscribe anytime.

Share
// More SignalsAll Posts →