51 KiB
Knowledge Genome System
A distributed, encrypted, multi-domain personal knowledge base. No vector database. No embedding pipeline. No external retrieval server.
Built on the LLM Wiki pattern by Andrej Karpathy — extended with a multi-domain submodule architecture, AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection, and a human-in-the-loop Git Flow for quality control.
Table of Contents
- Core Philosophy
- Architecture
- System Requirements
- Prerequisites
- Configuration
- Quick Start
- Makefile Reference
- Testing
- Genome Lifecycle
- Security Model
- Key Management
- Agent Sessions
- Workflows
- Knowledge Quality
- Knowledge Schema
- Collaboration Model
- Optional Extensions
- Troubleshooting
Core Philosophy
Most RAG systems make the LLM rediscover knowledge from scratch on every query. A document is indexed; at query time, relevant chunks are retrieved; an answer is generated. Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM pieces it together from fragments every single time.
This system is different. Instead of retrieval at query time, the LLM incrementally builds and maintains a persistent wiki that sits between you and the raw sources. When a new source arrives, the LLM reads it, extracts key information, updates entity and concept pages, flags contradictions with existing claims, and strengthens the evolving synthesis. Knowledge is compiled once and kept current.
The wiki is a compounding artifact. Cross-references are already there. Contradictions have been flagged. The synthesis already reflects everything ingested.
This means:
- No vector database.
- No embedding pipeline.
- No external retrieval infrastructure.
The wiki/index.md of each genome is the retrieval layer. At moderate scale
(~100 sources, hundreds of pages) this performs better than RAG because cross-references,
contradictions, and syntheses are already resolved — not re-derived per query.
The human's job: curate sources, direct analysis, ask good questions, review PRs. The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.
Architecture
Repository structure
master-knowledge-genome/ ← Root orchestrator (submodule registry)
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/ ← Submodule: web development, Angular, TUI
├── genome-finance/ ← Submodule: personal finance, investments
├── genome-homelab/ ← Submodule: Keru infrastructure, network configs
└── AGENTS.md ← Global coordination schema (cross-genome rules)
The genome names above (
genome-dev,genome-finance,genome-homelab) are illustrative — they show the kind of multi-domain layout this orchestrator targets. The shippedregistry.shdefines a single disposable sandbox,genome-test; you create real genomes yourself withmake add-genome(see the registry examples below).
Each genome is an independent git repository:
genome-{name}/
├── .gitattributes ← Encryption rules — **/private/** wildcard
├── .gitignore
├── .git/hooks/pre-commit ← Security hook (dynamic git check-attr)
├── AGENTS.md ← Per-genome agent contract and workflow rules
│
├── raw/ ← Immutable sources — LLM reads, never writes
│ ├── articles/ ← Web clips, saved articles
│ ├── transcripts/ ← Audio/video transcripts
│ ├── code-packs/ ← Code snippets and repositories
│ ├── assets/ ← Images, PDFs, binary files
│ └── private/ ← AES-256-CTR encrypted — owner only
│
└── wiki/ ← LLM-owned — agent creates and maintains
├── index.md ← Primary catalog (read first every session)
├── log.md ← Append-only operations ledger
├── sources/ ← One page per processed raw source
├── entities/ ← People, tools, organisations, projects
├── concepts/ ← Patterns, theories, architectural decisions
├── queries/ ← Preserved answers and conflict notes
└── private/ ← AES-256-CTR encrypted — owner only
Three layers
| Layer | Path | Owner | Rule |
|---|---|---|---|
| Raw sources | raw/ |
Human | Immutable. LLM reads only. Never modified. |
| Wiki | wiki/ |
LLM | Agent creates, updates, cross-links, maintains. |
| Schema | AGENTS.md |
Human + LLM | Co-evolved contract defining structure and workflows. |
Linked projects (optional)
A genome can optionally declare a linked project repository — a separate repo where
the knowledge in that genome is meant to be applied (e.g. genome-dev linked to an app
repo). The link is recorded as a third field in the registry and rendered into the
genome's AGENTS.md (## Linked Project). A genome with no link is knowledge-only and
behaves exactly as before. See Configuration.
Framework structure
knowledge-genome-orchestrator/ ← This repository (setup tooling)
├── globals.env ← Static KEY=VALUE config (Make-includable)
├── registry.sh ← Bash-only: GENOMES array + dynamic paths
├── Makefile ← Entry point for all operations
├── lib/
│ ├── output.sh ← Terminal helpers (colors, log levels)
│ ├── deps.sh ← Dependency validation
│ ├── scaffold.sh ← Template rendering engine
│ ├── structure.sh ← Canonical genome layout (single source of truth)
│ ├── lint.sh ← Per-file validation functions
│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate)
├── providers/
│ ├── forgejo.sh ← Forgejo REST API provider
│ └── github.sh ← GitHub REST API provider
├── scripts/
│ ├── setup.sh ← Main entry point
│ ├── setup-master.sh ← Master repo initialisation
│ ├── setup-genomes.sh ← Genome provisioning loop
│ ├── add-genome.sh ← Add a single new genome
│ ├── lint-genomes.sh ← Quality control across all genomes
│ └── verify-genomes.sh ← Structure verify / --sync across all genomes
├── templates/
│ ├── agents-genome.md ← Per-genome agent contract template
│ ├── agents-master.md ← Master coordination schema template
│ ├── readme-master.md ← Master repo README template
│ ├── wiki-index.md ← Index template (rendered per genome)
│ ├── wiki-log.md ← Log template (rendered per genome)
│ ├── pr-description.md ← PR review checklist template
│ ├── pre-commit.sh ← Security hook template
│ ├── gitattributes ← Git encryption rules template
│ └── gitignore ← Git ignore template
├── skills/
│ └── ingest/ ← pi skill: deployed to the AI node (vm101)
│ ├── SKILL.md ← Semantic-only contract (read/edit, emits manifest)
│ ├── references/ ← On-demand reference docs for the agent
│ └── scripts/ ← Deterministic post-processor (runs outside the agent)
│ ├── run-ingest.sh ← Orchestrator: consumes the manifest, emits one JSON line
│ ├── slug.sh ← Slug normalisation
│ ├── index-append.py ← Sorted insert into wiki/index.md + last_updated bump
│ ├── log-append.sh ← Append a wiki/log.md entry
│ ├── scoped-lint.sh ← Lint only the pages touched this run (reuses lib/lint.sh)
│ └── open-pr.sh ← Branch / commit / push / open PR (DRY_RUN seam for tests)
└── tests/ ← bats suite — deterministic, no LLM/GPU (see Testing)
├── helpers.bash
├── scripts.bats
├── lint.bats
├── structure.bats
└── run-ingest.bats
The
skills/ingest/directory is version-controlled here but deployed to the AI node (vm101) under~/.pi/agent/skills/ingest. The agent (pi) does only semantic work and writes a manifest;run-ingest.shdoes the mechanical steps. See Workflows → Ingest.ingest-semantic.py: one schema-constrained call to local model, returns JSON. run-ingest.sh: index/log/lint/PR. Semantic JSON extraction → deterministic wiki conform + manifest.
cp skills/ingest/* ~/.pi/agent/skills/ingest/ after make setup. Updated via git pull on laptop, pushed to vm101 via SSH in n8n flow.
System Requirements
Linux — full support (primary target)
All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+. All tools (git-crypt, bw, qmd) have native Linux binaries.
macOS — full support
All scripts are compatible with macOS. Requirements:
- bash 3.2+ (macOS default) — supported for the setup scripts (
maketargets, scaffolding). Two things need bash 4+: theingestskill (mapfile), which runs on the Linux AI node (not a constraint on the macOS setup machine); andgcrypt_rotate_key(compgen -G), which does run on the laptop. For key rotation on macOS, use Homebrew bash (brew install bash). - GNU coreutils not required — BSD variants of
date,grep,sedall handled. git-crypt: install via Homebrew —brew install git-cryptjq,curl: pre-installed or via Homebrew
If you use Homebrew bash (brew install bash), the scripts work identically to Linux.
Windows — WSL2 only
Git Bash and native Windows are not supported.
Reasons:
git-crypthas no native Windows binary.- Process substitution
<(...)used for runtime key injection is not available in Git Bash or PowerShell. - Several bash builtins used throughout (
compgen,BASH_SOURCE, arrays) are not available outside a POSIX-compliant shell.
WSL2 (Windows Subsystem for Linux) with Ubuntu gives full compatibility. All setup and runtime operations work identically to native Linux inside WSL2.
Hardware recommendations
The system is designed for a homelab architecture:
| Component | Recommended | Role |
|---|---|---|
| Storage node | Any Linux server with NFS | Hosts Forgejo, stores genome repos |
| AI compute node | GPU server (16GB+ VRAM) | Runs local LLM agent sessions |
| VRAM | 16GB minimum | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache |
| Local LLM | 14B–32B quantised | Active wiki maintenance sessions |
| Large LLM | 70B (async) | Deep reflection, complex synthesis (scheduled, not interactive) |
On VRAM constraints: with a 16GB card and a 14B model, the KV cache budget is ~6GB — approximately 32k tokens of effective context. Every token in
AGENTS.md, the index, and the log tail is a cost. This is why all agent files are token-optimised and sessions are kept to one source at a time.
Reference deployment: the table above is a target profile, not a hard requirement. The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they just make the "one source per session" discipline and the token budget matter more.
Prerequisites
Required
| Tool | Purpose |
|---|---|
git |
Version control |
git-crypt |
Transparent file encryption |
curl |
REST API calls to Forgejo/GitHub |
jq |
JSON parsing |
Optional
| Tool | Purpose |
|---|---|
bw |
Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) |
qmd |
Local BM25 + vector search for Markdown files with MCP server interface |
bwvsbws: Usebw(standard Bitwarden CLI).bwsis the Bitwarden Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.
Install on Ubuntu/Debian
sudo apt update && sudo apt install -y git git-crypt curl jq
Install on macOS
brew install git git-crypt curl jq
Install Bitwarden CLI
# Linux
npm install -g @bitwarden/cli
# macOS
brew install bitwarden-cli
Verify all tools
make doctor
Configuration
Configuration is split into two files with distinct purposes:
globals.env — static KEY=VALUE
Safe for make include, docker-compose, shell source, and any standard env parser.
Contains only simple scalar values — no bash syntax, no arrays.
# Provider selection
PROVIDER=forgejo # forgejo | github
# Forgejo (active when PROVIDER=forgejo)
FORGEJO_URL=https://git.yourserver.com
FORGEJO_USER=yourusername
FORGEJO_SSH_PORT=222 # Default for many homelab Forgejo setups; 22 for standard
# GitHub (active when PROVIDER=github — uncomment to use)
# GITHUB_USER=your-username
# GITHUB_ORG=your-org # Optional: for org repos, overrides GITHUB_USER
# Vaultwarden
VAULTWARDEN_URL=https://vault.yourserver.com
# Master repository
MASTER_REPO=master-knowledge-genome
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git
registry.sh — bash runtime config
Sourced by shell scripts only. Contains the genome registry array and dynamic path resolution. Never included by Make.
# Dynamic paths (resolved at source time)
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
KEYS_DIR="${WORK_DIR}/keys"
# Genome registry — format: "name|description|linked_repo"
# The third and fourth fields are OPTIONAL:
# - leave it empty → knowledge-only genome (no linked project)
# - owner/repo → genome is linked to that project repository (rendered into AGENTS.md)
# - cross_source → yes|no (default no): whether the cross-genome collector may read this genome as a source
GENOMES=(
"genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app|no"
"genome-finance|Personal finance, investments, market analysis||no"
"genome-homelab|Infrastructure, network configs, architecture logs||no"
)
To add a genome to the registry before running setup, append a line to GENOMES.
After initial setup, use make add-genome instead.
Tokens
Tokens are never stored in config files. Export them in your shell before running setup:
export FORGEJO_TOKEN="your_forgejo_token"
# or
export GITHUB_TOKEN="your_github_token"
Quick Start
# 1. Clone the setup framework
git clone <setup-repo-url> knowledge-genome-orchestrator
cd knowledge-genome-orchestrator
# 2. Configure your environment
cp globals.env.example globals.env # edit with your values
# Edit registry.sh to define your genomes
# 3. Export your provider token
export FORGEJO_TOKEN="your_token_here"
# 4. Verify dependencies
make doctor
# 5. Run full setup
make setup
make setup executes in order:
- Dependency check — verifies all required tools are installed
- Git identity check — warns if
user.name/user.emailare not configured - Master repo — creates
master-knowledge-genomeon Forgejo, scaffolds withAGENTS.mdandREADME.md, initialises git, addscore-karpathyas submodule, pushes - Genome provisioning — for each genome in
registry.sh:- Creates remote repository on Forgejo
- Adds it as a submodule in the master repo
- Initialises git-crypt (before any files are created)
- Scaffolds directory structure and renders all templates
- Installs pre-commit security hook
- Commits, pushes genome to remote
- Exports symmetric key to
keys/<genome>.key - Prints Vaultwarden upload instructions
- Commits submodule pointer in master repo
After setup completes:
- Upload all files in
keys/to Vaultwarden (see Key Management) - Delete key files from disk:
rm keys/*.key
Makefile Reference
| Target | Description |
|---|---|
make setup |
Full system initialisation — master repo + all genomes in registry.sh |
make add-genome NAME=x DESC="y" [LINKED=owner/repo] |
Scaffold and register a single new genome (optional linked project) |
make lint |
Run quality checks across all genomes (schema, privacy, decay, page size) |
make verify-structure |
Report directory drift of each genome vs the canonical layout (lib/structure.sh) |
make sync-structure |
Create any missing canonical directories across all genomes (safe, idempotent) |
make test |
Run the bats test suite (deterministic; no LLM/GPU/network) — see Testing |
make status |
Show submodule status and per-genome git-crypt encryption state |
make lock |
Lock all encrypted repos (master + all genome submodules) |
make doctor |
Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
make sync |
git submodule update --init --recursive + report unpushed commits per genome |
make help |
Print all available targets |
Examples
# Check system health
make doctor
# Add a new genome after initial setup
make add-genome NAME=genome-research DESC="Academic papers and deep research"
# Add a genome linked to a project repository
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app
# Check every genome against the canonical directory layout
make verify-structure
# Run full lint pass (bash deterministic checks)
make lint
# Sync all nodes after pulling on another machine
make sync
# Emergency lock — secures all repos before leaving a session
make lock
Testing
The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is covered by a bats suite. The tests are deterministic and have zero dependency on the LLM, the GPU, or the network — they simulate the agent's output with fixtures and exercise the scripts directly, so they run anywhere git + bash live (laptop, CI, a git hook). They are not meant to run on the AI node or via n8n.
sudo apt install bats # once
make test # or: bats tests/
| File | Covers |
|---|---|
scripts.bats |
slug.sh, log-append.sh, index-append.py (insert, sort, bump, idempotent) |
lint.bats |
lib/lint.sh validators + scoped-lint.sh |
structure.bats |
lib/structure.sh report / sync |
run-ingest.bats |
run-ingest.sh end-to-end (DRY_RUN, local bare remote) — needs jq |
Each test builds its own throwaway genome with a local bare remote, configured to ignore
the operator's global git settings (signing, global hooks) so the suite is hermetic. The
run-ingest tests auto-skip if jq is absent. If you change the canonical layout in
lib/structure.sh, update FIXTURE_DIRS in tests/helpers.bash to match.
Why this matters: the only non-deterministic part of the system is the model. Pinning the mechanical layer with tests means that when an ingest misbehaves, you know it's the model or the prompt — not the plumbing.
Genome Lifecycle
Initial setup
All genomes defined in registry.sh are provisioned by make setup.
Adding a genome after initial setup
make add-genome NAME=genome-newname DESC="Domain description"
This: creates the remote repo, adds it as a submodule, initialises git-crypt, scaffolds the directory structure, installs the pre-commit hook, commits and pushes, exports the key, and commits the submodule pointer in master.
After adding: upload the new key to Vaultwarden and delete the key file.
Removing a genome
Manual process:
# In master repo
git submodule deinit genome-name
git rm genome-name
git commit -m "chore: remove genome-name submodule"
git push
# Archive or delete the remote repository on Forgejo
Template rendering
When a genome is scaffolded, render_template replaces these placeholders in all
template files:
| Placeholder | Source | Example |
|---|---|---|
{{GENOME_NAME}} |
registry.sh | genome-dev |
{{GENOME_NAME_UPPER}} |
derived | GENOME-DEV |
{{GENOME_DESC}} |
registry.sh | Web development... |
{{LINKED_PROJECT}} |
registry.sh | myorg/my-app (or none) |
{{FORGEJO_URL}} |
globals.env | https://git.yourserver.com |
{{FORGEJO_USER}} |
globals.env | yourusername |
{{VAULTWARDEN_URL}} |
globals.env | https://vault.yourserver.com |
{{MASTER_REPO}} |
globals.env | master-knowledge-genome |
{{DATE}} |
runtime | 2026-05-11 |
Security Model
Encryption architecture
Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt. Two directories in every genome are always encrypted:
| Directory | Contents | On remote |
|---|---|---|
raw/private/ |
Sensitive source material | Opaque binary blob |
wiki/private/ |
Private synthesis and notes | Opaque binary blob |
All other directories (raw/articles/, wiki/sources/, etc.) are plaintext.
Collaborators without the key can contribute to public directories normally —
git handles encrypted files transparently.
.gitattributes — dynamic encryption rules
Encryption rules use a glob wildcard that catches any private/ directory at
any depth in the repository — including directories created at runtime by the LLM:
# Text rules first
*.md text eol=lf
*.sh text eol=lf
# Encryption rules LAST (later rules override per-attribute)
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
**/private/** filter=git-crypt diff=git-crypt -text
Rule ordering matters: in
.gitattributes, the last matching rule wins per attribute. Encryption rules must come after text rules so-textoverridestext eol=lffor encrypted markdown files.
Pre-commit hook — dynamic validation
The security hook installed at .git/hooks/pre-commit validates every staged file
dynamically — it reads encryption requirements from .gitattributes at runtime
rather than checking hardcoded paths:
# For each staged file, check if git-crypt encryption is required
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
if [[ "$filter" == "git-crypt" ]]; then
# Verify the file is actually encrypted
if git-crypt status "$file" | grep -q "not encrypted"; then
# BLOCK THE COMMIT
fi
fi
This means: any file matching **/private/** in .gitattributes is protected,
including future private/ directories created anywhere in the repo.
The hook never needs updating when the encryption rules change.
Untrusted agent output — manifest validation
The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
a wrong type, or a malicious path such as wiki/../../etc/passwd. run-ingest.sh therefore
validates the manifest before trusting any field — it must be well-formed JSON with a
string raw_source and an array pages, and every path must be a string under wiki/
with no ... Anything else fails fast with a structured {"status":"error"} and no
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
knowledge tree. This is the trust boundary between the (stochastic) model and the
(deterministic, tested) post-processor.
PRIVATE_CONTEXT toggle
The PRIVATE_CONTEXT toggle in AGENTS.md controls whether the LLM agent
accesses encrypted directories. It must be declared explicitly by the operator
at the start of every session:
PRIVATE_CONTEXT: disabled ← Default. private/ directories are treated as non-existent.
PRIVATE_CONTEXT: enabled ← Agent may read/write private/. Requires git-crypt unlock.
Rules:
- Never inferred. Never carried over from a previous session.
enabledrequires the operator to confirm thatgit-crypt unlockhas run on the host.- Per-genome, per-session: enabling for
genome-financedoes NOT enable forgenome-dev. - Cloud LLM models:
PRIVATE_CONTEXTmust always bedisabled. Private data never leaves the local network. - All outputs derived from private data are prefixed
[PRIVATE DATA INCLUDED]. - Private synthesis goes exclusively to
wiki/private/— never to public wiki paths.
Runtime key injection — zero disk policy
Encryption keys are never stored as persistent files on the AI server.
They are injected at session start via the Bitwarden CLI (bw) against
your self-hosted Vaultwarden instance, using process substitution:
# Step 1: authenticate
bw config server https://vault.yourserver.com
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
git-crypt unlock <(
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)
The key flows: Vaultwarden → bw get notes → base64 -d → kernel pipe → git-crypt.
At no point is the key written to any file on disk.
Lock a genome when the session ends:
git-crypt lock
Key Management
This section is for the operator. These commands are never issued by the LLM agent.
Vaultwarden Secure Notes
Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:
| Genome | Vaultwarden Note Name |
|---|---|
genome-dev |
genome-dev key |
genome-finance |
genome-finance key |
genome-homelab |
genome-homelab key |
After make setup or make add-genome, key files are exported to keys/.
Upload procedure:
# Encode the key
base64 < keys/genome-dev.key
# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
# Then delete the key file
rm keys/genome-dev.key
Cloning on a new machine
# Full clone with all submodules
git clone --recurse-submodules \
https://git.yourserver.com/yourusername/master-knowledge-genome.git
# Unlock a specific genome (with key file — development only)
cd master-knowledge-genome/genome-dev
git-crypt unlock /path/to/genome-dev.key
# Unlock via Vaultwarden (recommended — no key on disk)
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)
# Sparse clone — collaborator who only needs one genome
git clone https://git.yourserver.com/yourusername/genome-dev.git
Key rotation (emergency)
If a key is lost or compromised:
# From the knowledge-genome-orchestrator/ directory
source lib/git-crypt.sh
# If gcrypt_rotate_key operates on the CWD: cd into .../master-knowledge-genome/genome-dev
# If it navigates by name instead: cd into .../master-knowledge-genome
cd ~/knowledge-genome-orchestrator/master-knowledge-genome
gcrypt_rotate_key "genome-dev"
macOS:
gcrypt_rotate_keyusescompgen -G(bash 4+). The stock macOS bash 3.2 is not enough — run rotation under Homebrew bash (brew install bash).
gcrypt_rotate_key performs:
- Unlocks repo with existing key
- Removes old key material
- Generates new symmetric key via
git-crypt init - Re-stages and commits private files (encrypted with new key)
- Exports new key to
keys/ - Prints Vaultwarden update instructions
Limitation: git history still contains blobs encrypted with the old key. Anyone with the old key and git history access can decrypt them. To purge old encrypted blobs from history:
git filter-repo --invert-paths --path raw/private --path wiki/private git push --force origin mainThis rewrites all commit hashes — coordinate with any collaborators first.
After rotation:
- Upload new key to Vaultwarden (replace existing note)
- Delete both
keys/genome-dev.keyandkeys/genome-dev-rotated-*.keyfrom disk - Revoke access from previous key holders
Agent Sessions
Prerequisites for every session
Before starting an LLM agent session on a genome:
- The host (AI server) runs
git-crypt unlockfor the required genomes - The orchestrator prepares context:
tail -n 20 wiki/log.md - Declare
PRIVATE_CONTEXTstate explicitly in the opening prompt
Session start protocol
The agent executes in this order at the start of every session:
- Read
wiki/index.md— primary catalog of all pages and maturity - Read last 20 log entries (injected by orchestrator — does NOT open
wiki/log.mddirectly) - For tasks involving related pages: if the optional
qmdextension is installed,qmd search "<query>"before opening files; otherwise navigate fromwiki/index.md - Operate on individual files — never scan entire directories
One source per session
With a 14B model and ~6GB KV cache budget, long sessions degrade. As the session extends, the context fills with pages already created, attention dilutes, and later entities receive worse cross-references than earlier ones.
Hard rule: one source per session.
If multiple sources are queued in raw/, process only the first.
Commit, close the session. The orchestrator (n8n or script) starts a new session
for the next source with a clean KV cache.
For automated pipelines: if 5 files arrive in raw/, trigger 5 agent sessions
sequentially — not one session with 5 files.
n8n automation
For Forgejo webhook → automated ingest:
- Forgejo sends webhook on push to
raw/ - n8n receives webhook, identifies new files
- n8n starts one agent session per new file (sequential, not parallel)
- Each session: realign the checkout to the base (
git switch <base> && git reset --hard origin/<base>), then injecttail -n 20 wiki/log.md+PRIVATE_CONTEXTstate + source path - Phase 1 agent (
/skill:ingest) writes the manifest; Phase 2run-ingest.shopens the PR, then stops - Human reviews — merge to accept, or close the PR + delete the
featbranch to reject
Workflows
Ingest
Triggered by a new file in raw/ (manual or via webhook). Ingest is split into two
phases so that the small local model spends its limited context only on judgement, and
all the deterministic bookkeeping happens outside the model's loop.
Phase 1 — agent (semantic only). The ingest skill gives the agent read/edit tools
only (no shell). It:
- Reads the source once
- Creates
wiki/sources/<slug>.md— summary and key points - Per entity (person, tool, organisation): creates or updates
wiki/entities/<name>.md - Per concept (pattern, theory, decision): creates or updates
wiki/concepts/<name>.md - Checks each touched page for contradictions → applies Conflict Resolution if found
- Writes
.ingest-manifest.json(the list of pages it created/modified, the model name, a one-line reasoning, the PR summary, and any contradictions) — then stops
Phase 2 — run-ingest.sh (deterministic, outside the agent). The post-processor first
validates the manifest — well-formed JSON, expected shape, and every page path confined to
wiki/ with no .. (see Security Model) — then does the mechanical work the
model must not waste context on:
- Inserts each page into the correct
wiki/index.mdsection in alphabetical order, deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the indexlast_updated(index-append.py) - Appends the
INGEST | <slug>entry towiki/log.md(the model name comes from the orchestrator viaINGEST_MODEL— the agent cannot reliably know its own tag) - Runs scoped lint on exactly the pages touched this run (
scoped-lint.sh, reusinglib/lint.sh) - Commits only
wiki/onfeat/ai-ingest-<slug>and opens a PR against the integration base (INGEST_BASE, defaultmain); the body matches thetemplates/pr-description.mdstructure (Summary / Pages / Contradictions / Scoped Lint) - Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
The agent never runs git, never edits the index/log mechanically, and never lints — those are deterministic and tested (see Testing). Invocation on the AI node:
pi --mode json -p "/skill:ingest raw/articles/<file>.md" # phase 1 → writes manifest
run-ingest.sh <genome> # phase 2 → index/log/lint/PR
For private sources (PRIVATE_CONTEXT: enabled required):
- All output goes to
wiki/private/<slug>.mdonly - PR title:
[PRIVATE] ingest: <slug>
Branch lifecycle & the manual gate. run-ingest.sh / open-pr.sh are deliberately
"dumb": they create the feat/ai-ingest-<slug> branch, commit only wiki/, open the PR, and
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
the orchestrator, around the human gate:
- Before each session the orchestrator realigns the checkout to the base
(
git fetch && git switch <base> && git reset --hard origin/<base>) — a reset of the local checkout to match the remote, never a force-push to the shared branch. - After the PR opens, everything stops until a human approves: one source per session, sequential, no new ingest until the pending PR is closed.
- Approve = merge. Reject = close the PR and delete the remote
featbranch. To undo an already-merged ingest, open a revert PR against the base — never rewrite history on a shared branch.
The PR base is configurable via INGEST_BASE (default main). Per-page maturity already
encodes stability and tags/releases mark versioned snapshots, so main is the integration
branch today. If a linked project later consumes a genome, set INGEST_BASE=develop to
buffer ingests on develop and cut manual develop → main releases — no code change.
Query
Triggered by an operator question.
qmd search "<query>"(if the optional qmd extension is installed) → identify candidate pages; otherwise start fromwiki/index.md- Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
- Synthesise answer with
[[wikilink]]citations - If answer is non-trivial: save as
wiki/queries/<slug>.mdand append to index - Append log entry:
QUERY | <subject>
For general orientation without a specific query: read wiki/index.md directly.
Lint
The lint workflow is split between deterministic bash checks and semantic LLM judgment.
Step 1 — operator runs bash linter:
make lint
The bash linter checks automatically:
- YAML frontmatter validity (all mandatory fields present)
- Domain consistency (domain field matches genome name)
- Type validity (value from allowed list)
- Privacy consistency (
private/directories haveprivate: true) - Page size (warn at 400 lines, error at 800 lines)
- Knowledge decay (stable > 180 days, draft > 90 days)
- Broken internal wikilinks (warnings only — cross-type links produce expected false positives)
Step 2 — operator provides bash output to LLM agent:
The agent applies semantic judgment to findings the bash linter cannot make:
- Orphan pages (from bash list): for each orphan, identify 1-3 existing pages that should link to it; propose specific additions
- Implicit concepts (from bash term frequency list): determine if a candidate term warrants a dedicated page; draft stub if yes
- Duplicate concepts:
qmd search "<concept>"for suspected duplicates; propose merge if confirmed - Maturity promotion: pages with 2+ sources still marked
draft→ proposestable
The agent reports all findings as a structured list. It does not modify files
without operator approval. Appends LINT | <summary> log entry.
Knowledge Quality
PR review workflow
Every agent session that modifies wiki pages opens a PR.
The PR description uses templates/pr-description.md:
## Summary
One sentence: goal of this session and source processed.
## Pages Created
| Path | Type | Maturity |
## Pages Modified
| Path | Change |
## Contradictions Found
[ ] None / [ ] n conflict file(s) created
## Private Data Accessed
[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes
## Scoped Lint (post-ingest)
[ ] Frontmatter valid [ ] No broken links [ ] No issues found
This makes human review fast and structured: read the table, scan the diff, approve or request changes. No exploration required to understand what the agent did.
Conflict resolution
When new evidence contradicts an existing wiki claim:
- Keep the existing page unchanged
- Create
wiki/queries/conflict-<concept>-<YYYY-MM-DD>.mdwith:- The existing claim and its source
- The contradicting evidence and its source
- Agent confidence assessment for each
- Recommendation:
accept_b|keep_a|requires_human_review
- Add entry to
wiki/index.md→ Conflicts Pending Review section - Log entry:
CONFLICT | <concept> - Open PR:
[CONFLICT] <concept> — human review required
The operator resolves the conflict, updates relevant pages, closes the PR.
Knowledge decay
Pages have a last_updated field in frontmatter. During lint passes:
| Maturity | Threshold | Action |
|---|---|---|
stable |
180 days | Flag as stale — add ⚠️ STALE callout |
draft |
90 days | Flag as stale — add ⚠️ STALE callout |
The agent proposes re-validation but does not change maturity without new source evidence.
Cross-genome references
Cross-domain knowledge moves by pull, never push: the genome you are working in draws material in; nothing is ever written into another genome. There are no cross-genome wikilinks — submodule pointers make relative paths brittle.
When the working genome needs a concept that lives elsewhere, the navigation skill handles it in the same two-phase shape as ingest:
- A deterministic collector clones the relevant genomes read-only at HEAD (fresh — never the pinned submodule state) and assembles a dossier of excerpts with provenance.
- A semantic pass reads only that dossier; the skill then deposits one abstract, non-private
raw into the working genome at
raw/articles/crossgen-<topic>-<date>.md. - That raw goes through the working genome's normal ingest → PR → human gate, like any source.
Which genomes may be read as sources is gated by a per-genome cross_source: yes|no flag: a
confidential genome (e.g. a client file) is marked no and is never read as a source — the wall
is structural, not a matter of the agent's discipline. The master AGENTS.md holds the full
boundary contract.
Knowledge Schema
Frontmatter
Every wiki page must start with valid YAML frontmatter:
---
title: "Strict String Title"
type: source | entity | concept | query | conflict | private
domain: genome-name
tags: [lowercase, hyphen-separated]
maturity: draft | stable | deprecated
last_updated: YYYY-MM-DD
private: true | false
---
| Field | Rules |
|---|---|
type |
Must be one of: source entity concept query conflict private index log |
maturity: draft |
Single source or unvalidated |
maturity: stable |
Confirmed by 2+ independent sources |
maturity: deprecated |
Superseded — add > **DEPRECATED:** <reason> callout at top |
private: true |
Required on all pages in wiki/private/ and raw/private/ |
Do not use semantic versioning for content. Git history tracks every change.
maturity captures epistemic state; last_updated tracks recency.
Page types and directories
| Type | Directory | Description |
|---|---|---|
source |
wiki/sources/ |
One page per processed raw source |
entity |
wiki/entities/ |
People, tools, organisations, projects |
concept |
wiki/concepts/ |
Patterns, theories, architectural decisions |
query |
wiki/queries/ |
Preserved answers and analyses |
conflict |
wiki/queries/conflict-*.md |
Unresolved contradictions |
private |
wiki/private/ |
Private synthesis (PRIVATE_CONTEXT: enabled) |
index |
wiki/index.md |
Primary navigation catalog (singleton) |
log |
wiki/log.md |
Operations ledger (singleton) |
Page size limits
| Limit | Lines | Action |
|---|---|---|
| Soft cap | 400 | Bash linter warns |
| Hard cap | 800 | Bash linter errors — split the page |
These limits ensure pages fit within the LLM context window without attention degradation and keep the wiki atomically navigable.
Linking conventions
- Intra-genome:
[[folder/file]]— Obsidian wikilinks only. - Cross-genome: NOT supported via wikilink — submodule pointers make relative paths brittle. When the working genome needs a concept that lives elsewhere, the navigation skill pulls it in as one abstract raw under this genome's
raw/articles/, which then goes through normal ingest. See Cross-genome references. - External:
[text](https://...)— standard Markdown.
Log format
Every operation appends one entry to wiki/log.md:
## [YYYY-MM-DD] TYPE | Subject
- run_id: `<uuid>`
- model: `<model-name>`
- context_read: `[[path/A]]`, `[[path/B]]`
- output_written: `[[path/C]]`
- reasoning: One sentence — what changed and why.
Valid TYPEs: INGEST LINT QUERY CONFLICT CONFIG SECURITY
Parse examples:
grep "^## \[" wiki/log.md | tail -5 # Last 5 entries
grep "^## \[" wiki/log.md | grep "CONFLICT" # All conflicts
grep "^## \[2026-05" wiki/log.md # Entries from a specific month
ingest-semantic.py receives source text + existing entity/concept names (from index) as prompt context. The LLM never loads the full log.
Collaboration Model
| Role | Key access | Permitted operations |
|---|---|---|
| Owner | Full — key holder | Read/write everywhere |
| Collaborator | None | Push to raw/articles/, raw/transcripts/, raw/code-packs/, raw/assets/ |
| Local AI agent | Conditional | private/ only when PRIVATE_CONTEXT: enabled |
| Cloud AI model | Never | PRIVATE_CONTEXT must be disabled; private data stays on local network |
Grant collaborator access: add as Forgejo contributor with Write role. Never share the git-crypt key — collaborators operate exclusively in public directories.
Optional Extensions
qmd — local Markdown search
qmd is a local, on-device BM25 + vector search engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls) and an MCP server (for native LLM tool use).
Recommended at scale: once a genome exceeds ~150 pages, qmd search is significantly
faster and more accurate than navigating wiki/index.md manually.
# Index a genome
qmd index genome-dev/wiki/
# Search
qmd search "graph-based state management"
# Start MCP server (for Claude Code / Codex integration)
qmd serve --port 3333
Obsidian integration
Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.
Recommended setup:
- Graph view — visualise page connections; spot orphans and hubs instantly
- Obsidian Web Clipper — browser extension to clip articles directly to
raw/articles/as Markdown - Download attachments — Settings → Hotkeys → "Download attachments for current file".
Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to
raw/assets/ - Dataview plugin — query YAML frontmatter across the wiki;
TABLE maturity, last_updated WHERE domain = "genome-dev"generates dynamic tables - Marp plugin — render Markdown as slide decks directly from wiki content
Note: .obsidian/ is in .gitignore. Workspace and plugin settings are local — not synced.
n8n automation
n8n → SSH → ingest-semantic.py → run-ingest.sh .
n8n (running on the storage node) can automate the ingest pipeline:
- Forgejo webhook fires on push to a genome's
raw/directory - n8n flow identifies new files
- For each new file: starts one agent session (sequential — never parallel)
- Each session receives:
tail -n 20 wiki/log.md+PRIVATE_CONTEXTstate + source path - Phase 1 — agent runs
/skill:ingest(semantic → writes manifest); Phase 2 —run-ingest.shdoes index/log/lint and opens the PR, returning one JSON line to n8n - Human reviews the PR
Key constraint: one source per session, sessions sequential. Never batch multiple sources into one agent session.
Intel NPU offloading
If the AI compute node has an Intel NPU (e.g. Core Ultra series):
- Background/auxiliary tasks (OCR of
raw/assets/, async summarisation, or qmd re-indexing if the optional qmd extension is in use) → Intel NPU via OpenVINO - Active reasoning sessions (ingest, query, synthesis) → GPU
Note: the core system has no embedding pipeline (see Core Philosophy), so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the GPU's KV cache free for interactive sessions and lowers power draw for background jobs.
Troubleshooting
git-crypt: command not found
# Ubuntu/Debian
sudo apt install git-crypt
# macOS
brew install git-crypt
make setup fails with "MISSING: jq"
make doctor # identifies all missing tools
sudo apt install git git-crypt curl jq
Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"
The staged file is in a path matching **/private/** but is not encrypted.
Fix options:
- Verify
.gitattributescontains**/private/** filter=git-crypt diff=git-crypt -text - Run
git-crypt initif git-crypt is not initialised in this repo - Run
git-crypt statusto check the encryption state of all files
Never use git commit --no-verify to bypass this check.
git-crypt status shows files as "not encrypted" after init
The .gitattributes rule must be committed before files in private/ are staged.
If files were staged before .gitattributes was committed:
git rm -r --cached raw/private/ wiki/private/
git add raw/private/ wiki/private/
git commit -m "fix: re-stage private files for encryption"
Agent returns stale or missing cross-references
Likely causes:
- Session was too long — KV cache degraded. Use one source per session.
wiki/index.mdwas not read at session start — agent lacked the page catalog.- qmd index is stale — re-index:
qmd index <genome>/wiki/
Submodules show as "modified" after make sync
This is normal if genome repos have new commits. Update master's pointers:
cd master-knowledge-genome
git add .
git commit -m "chore: update submodule pointers"
git push
bw unlock fails
Verify you are using bw (standard Bitwarden CLI), not bws (Secrets Manager CLI).
bws does not work with self-hosted Vaultwarden.
bw --version # should print e.g. "2024.x.x"
bw config server https://vault.yourserver.com
bw login