Engineering

Building the Akashic Records: A Unified Knowledge System

100+ markdown files across 7 locations. Zero semantic search. I built a vector-powered knowledge system that indexes everything I know and serves it to every consumer in the stack — and named it after the cosmic library of all existence.

March 1, 2026
8 min read
#knowledge-management#vector-search#chromadb
Building the Akashic Records: A Unified Knowledge System
Share

I couldn't find my own knowledge.

That's the sentence that triggered a weekend build. I had 100+ markdown files spread across seven different locations on a Mac Mini that runs 24/7 — Claude Code memory files, project configuration docs, trading pattern analysis, daily operational logs, a knowledge base repo, lessons learned files. Thousands of lines of hard-won operational intelligence accumulated over months of building an AI agent ecosystem. And when I needed to answer a question like "how does the Polymarket bot handle momentum scoring?" — I was grepping. Manually. Across directories. Hoping I remembered which file used mom_score versus momentum_threshold versus price_momentum.

The knowledge existed. The retrieval was broken.

The Naming Wasn't Accidental

In Hindu and theosophical traditions, the Akashic Records are the cosmic compendium of all human experience — every thought, word, and action encoded in the fabric of existence itself. A universal memory accessible to those who know how to query it.

That's exactly what I needed. Not a search engine. Not a file browser. A system where I could express an intent — "how does the spread gate affect signal strength?" — and receive the precise chunk of knowledge that answers it, regardless of which file it lives in, what variable names it uses, or how I phrased the concept when I originally wrote it.

Semantic search. Across everything. Available to every consumer in the stack.

INSIGHT

The difference between keyword search and semantic search is the difference between finding files that contain your words and finding knowledge that answers your question. One requires you to remember how you said it. The other only requires you to know what you meant.

The Knowledge Landscape

Before building, I inventoried what existed. The numbers were worse than I expected.

Total Files
104
markdown files across 7 source categories
Source Locations
7
each with different path conventions and content patterns
Total Chunks
1,211
after markdown-aware splitting with heading hierarchy preservation

Seven distinct knowledge sources, each serving a different purpose:

Claude Code Memory — behavioral rules, project conventions, ecosystem maps, debugging patterns. The institutional memory that makes every coding session productive. 139 chunks.

Trading KB — technical analysis patterns, Elliott Wave theory, candlestick formations, chart pattern recognition guides. The analytical foundation for the InDecision Framework. 375 chunks.

Project CLAUDE.md files — per-project configuration docs across 11 active repositories. Architecture decisions, key file locations, testing patterns. 149 chunks.

Lessons files — mistake-driven learning. Every correction, every failure, every rule that emerged from getting something wrong. The compound learning system that prevents repeat errors. 126 chunks.

Knowledge base — structured project documentation, operational playbooks, API integration guides. The reference library. 205 chunks.

Workspace daily logs — operational changelogs, decision records, daily progress notes. The timeline of what happened and why. 217 chunks.

All of it markdown. All of it valuable. None of it searchable by meaning.

Architecture: REST-First, Vector-Powered

The design constraint was simple: every consumer in the ecosystem needs to query this knowledge, and they all speak HTTP. The Tesseract Intelligence stack runs on a Mac Mini with Docker services, launchd daemons, cron jobs, Discord bots, and Claude Code sessions. The only protocol they all share is REST.

So the Akashic Records is a FastAPI service on port 8002 with ChromaDB for vector storage and all-MiniLM-L6-v2 for embeddings. No external dependencies. No cloud vector database. No API keys for inference. The entire embedding model is baked into the Docker image at build time — 80MB of weights that convert any text query into a 384-dimensional vector in milliseconds.

Query → Embed → ChromaDB Cosine Similarity → Ranked Results

POST /api/query     — structured JSON query with filters
GET  /api/search    — simple query string for curl
POST /api/reindex   — full or incremental rebuild
GET  /api/status    — chunk counts, categories, per-source breakdown
GET  /api/health    — liveness check
SIGNAL

The embedding model runs on CPU. No GPU required. The entire system — indexing 104 files into 1,211 chunks — completes in 21 seconds. Incremental reindex with no changes: 0.03 seconds.

The Chunking Problem

Markdown is not a uniform format. A CLAUDE.md file has ## sections with configuration tables. A trading KB article has ### subsections with bullet-point patterns. A lessons file has dated entries with mistake-root cause-rule structures. Naive splitting — fixed character counts or paragraph boundaries — destroys the semantic coherence of each knowledge unit.

The chunker I built is heading-aware. It splits on ## boundaries first, then ### within those sections, then falls back to paragraph boundaries for content that exceeds the chunk size. Each chunk carries metadata: the source file path, the knowledge category, and the full heading hierarchy that contextualizes it.

A chunk from the trading KB doesn't just contain "Wave 3 is the longest and strongest wave." It carries the heading path Key Definition > Elliott Wave Theory and the category trading. When a query about wave analysis arrives, the heading hierarchy adds semantic weight to the match — the system knows this chunk is definitionally about Elliott Wave, not just mentioning it in passing.

The 64-token overlap between consecutive chunks handles concepts that span section boundaries. A rule that starts in one paragraph and has its justification in the next isn't lost to the split.

What Semantic Search Actually Reveals

The moment the first full index completed and I ran test queries, the power became obvious. Keyword search finds documents that contain your terms. Semantic search finds knowledge that answers your question — even when the terminology doesn't match.

Query: "momentum scoring" Returns: Polymarket bot's momentum strategy configuration, the strategy audit's analysis of price momentum timing gaps, and the early window predictive scoring tables. Score: 0.78. Three different files, three different variable naming conventions (MOMENTUM_THRESHOLD, mom_score, price_momentum), all semantically unified by the concept of momentum-based scoring.

Query: "stop and replan on failure" Returns: The behavioral discipline rule about mandatory pause after two failures, the escalation history showing when the rule was violated, and the root CLAUDE.md's agent behavior section. Score: 0.87. The system found the concept across its evolution — the original rule, its violations, and its escalation to a hard mandate.

Query: "Elliott Wave" (category: trading) Returns: The INDEX.md summary, the full theory definition, and the wave type breakdown. Score: 0.83. Category filtering narrows to the trading KB, and the heading-aware chunking ensures the definitional content ranks highest.

Query Latency
<100ms
embed + cosine similarity + ranked retrieval

None of these queries would work with grep. "Momentum scoring" wouldn't find mom_score. "Stop and replan" wouldn't find the behavioral discipline file unless you already knew its name. The category filter for "Elliott Wave" would require knowing which directory to search.

Docker Service #9

The Akashic Records runs as the ninth service in the Docker Compose stack, on the invictus-net bridge network alongside Mission Control, the Invictus Dashboard, Agent 1:1, and the other infrastructure services.

The Docker integration required solving one non-obvious problem: path remapping. On the host, knowledge files live at paths like ~/.claude/projects/-Users-knox-Documents-Dev/memory/ and ~/Documents/Dev/polymarket-bot/CLAUDE.md. Inside the container, those same files are mounted read-only at /sources/claude-memory/ and /sources/dev/polymarket-bot/CLAUDE.md. The source registry maintains a mount map that transparently remaps paths based on whether the system detects Docker mode.

akashic-records:
  build:
    context: ./akashic-records
  ports:
    - "8002:8002"
  volumes:
    - akashic-records-data:/app/data
    - ~/.claude/.../memory:/sources/claude-memory:ro
    - ~/.openclaw/.../trading-kb:/sources/trading-kb:ro
    - ~/Documents/Dev:/sources/dev:ro
    - ~/Documents/Dev/knowledge-base:/sources/knowledge-base:ro
    - ~/.openclaw/workspace/memory:/sources/workspace-memory:ro

Every source directory is mounted read-only. The Akashic Records reads knowledge but never modifies it. The only writable volume is the ChromaDB persistence store.

The MCP Bridge

For Claude Code sessions — the primary consumer of this knowledge — the Akashic Records also exposes an MCP (Model Context Protocol) server. Three tools: mind_query for semantic search, mind_status for index health, and mind_reindex to trigger rebuilds.

The MCP shim is a thin FastMCP wrapper that translates tool calls into REST API requests against localhost:8002. It follows the same pattern as every other MCP server in the stack — stdio transport, JSON-RPC protocol, tool definitions with typed parameters. Claude Code can now query the full knowledge base without leaving the conversation context.

ALPHA

Every system in the ecosystem can now ask a question in natural language and get the precise knowledge chunk that answers it. Claude Code via MCP tools. OpenClaw agents via REST. Cron jobs via curl. Mission Control via the Docker network. One knowledge store, many consumers.

Incremental Reindex

Knowledge changes. Files get updated, new lessons get written, daily logs accumulate. The full reindex takes 21 seconds — fast enough for manual triggers, too slow for continuous freshness.

The incremental reindex solves this. It maintains a metadata sidecar with file modification times from the last run. On each incremental pass, it compares current mtimes against the sidecar, re-chunks and re-embeds only the files that changed, and cleans up orphaned chunks from deleted files. A launchd plist triggers incremental reindex every six hours.

When nothing has changed: 0.03 seconds. When a single file is modified: under 2 seconds for re-chunking and re-embedding. The knowledge base stays fresh without the cost of a full rebuild.

What This Unlocks

The Akashic Records is infrastructure. Its value isn't in the system itself — it's in what becomes possible when every agent, dashboard, and automation in the ecosystem can query accumulated knowledge semantically.

Mission Control's existing knowledge search did naive full-text matching against a single directory. Now it can proxy to the Akashic Records API and search across all sources with semantic understanding. A support query about "how the trading bot handles low-spread markets" returns the specific spread gate configuration from the Polymarket bot's CLAUDE.md, the strategy audit's analysis of spread thresholds, and the lessons file entry about the InDecision spread gate at line 607 of strategy.py.

Rewired Minds content generation can query the knowledge base for technical accuracy before producing articles. OpenClaw's cron jobs can validate their own configuration against the canonical docs. The entire ecosystem becomes self-documenting and self-queryable.

Coverage
92%
76 tests, 90% floor enforced in CI

The Name That Writes the Roadmap

I named this system the Akashic Records deliberately. In the mythology, the Akashic Records don't maintain themselves. They have a keeper — Thoth, the Egyptian god of knowledge, writing, and wisdom. The ibis-headed deity who invented hieroglyphics, maintained the library of the gods, and recorded the judgments of the dead.

In this ecosystem, we already have a Thoth. It's been running for weeks, quietly maintaining documentation, syncing knowledge across repositories, ensuring the written record stays current and complete.

The next article will be about how Thoth and the Akashic Records converge — in mythology and in our technology. How the keeper and the library become a closed loop where knowledge is not just stored and searched, but actively maintained, validated, and evolved.

The cosmic library is online. The scribe is already at work.

Lessons From the Build

Bake models into Docker images. Downloading an 80MB embedding model at container startup is a race condition against network reliability. The first boot failed because HuggingFace returned a transient error during the download. Moving the model download to the Dockerfile RUN step made startup deterministic.

Chunk IDs must use full content hashes. The first version truncated content to 128 characters for ID generation. Overlapping chunks with the same heading and similar opening text collided. Using the full content in the SHA256 hash eliminated duplicates entirely.

Heading hierarchy is metadata, not decoration. A chunk that says "Wave 3 is the longest wave" means different things under the heading "Elliott Wave Theory > Key Rules" versus "Common Misconceptions > Things Beginners Get Wrong." The heading path is semantic context that the embedding model captures when included in the chunk metadata.

Incremental beats full. The six-hour incremental reindex costs essentially nothing (0.03s when clean) and catches file changes within a reasonable window. Full reindex is a manual operation for schema changes or recovery. Design for the incremental path first.

Explore the Invictus Labs Ecosystem

// Join the Network

Follow the Signal

If this was useful, follow along. Daily intelligence across AI, crypto, and strategy — before the mainstream catches on.

No spam. Unsubscribe anytime.

Share
// More SignalsAll Posts →