Matteo Cherubini 6615d9b1d6 docs: Clarify genome naming convention and registry setup in README

2026-06-19 05:49:41 +02:00

51 KiB

Raw Blame History

Knowledge Genome System

A distributed, encrypted, multi-domain personal knowledge base. No vector database. No embedding pipeline. No external retrieval server.

Built on the LLM Wiki pattern by Andrej Karpathy — extended with a multi-domain submodule architecture, AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection, and a human-in-the-loop Git Flow for quality control.

Core Philosophy
Architecture
System Requirements
Prerequisites
Configuration
Quick Start
Makefile Reference
Testing
Genome Lifecycle
Security Model
Key Management
Agent Sessions
Workflows
Knowledge Quality
Knowledge Schema
Collaboration Model
Optional Extensions
Troubleshooting

Core Philosophy

Most RAG systems make the LLM rediscover knowledge from scratch on every query. A document is indexed; at query time, relevant chunks are retrieved; an answer is generated. Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM pieces it together from fragments every single time.

This system is different. Instead of retrieval at query time, the LLM incrementally builds and maintains a persistent wiki that sits between you and the raw sources. When a new source arrives, the LLM reads it, extracts key information, updates entity and concept pages, flags contradictions with existing claims, and strengthens the evolving synthesis. Knowledge is compiled once and kept current.

The wiki is a compounding artifact. Cross-references are already there. Contradictions have been flagged. The synthesis already reflects everything ingested.

This means:

No vector database.
No embedding pipeline.
No external retrieval infrastructure.

The wiki/index.md of each genome is the retrieval layer. At moderate scale (~100 sources, hundreds of pages) this performs better than RAG because cross-references, contradictions, and syntheses are already resolved — not re-derived per query.

The human's job: curate sources, direct analysis, ask good questions, review PRs. The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.

Architecture

Repository structure

master-knowledge-genome/              ← Root orchestrator (submodule registry)
├── core-karpathy/                    ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/                       ← Submodule: web development, Angular, TUI
├── genome-finance/                   ← Submodule: personal finance, investments
├── genome-homelab/                   ← Submodule: Keru infrastructure, network configs
└── AGENTS.md                         ← Global coordination schema (cross-genome rules)

The genome names above (genome-dev, genome-finance, genome-homelab) are illustrative — they show the kind of multi-domain layout this orchestrator targets. The shipped registry.sh defines a single disposable sandbox, genome-test; you create real genomes yourself with make add-genome (see the registry examples below).

Each genome is an independent git repository:

genome-{name}/
├── .gitattributes                    ← Encryption rules — **/private/** wildcard
├── .gitignore
├── .git/hooks/pre-commit             ← Security hook (dynamic git check-attr)
├── AGENTS.md                         ← Per-genome agent contract and workflow rules
│
├── raw/                              ← Immutable sources — LLM reads, never writes
│   ├── articles/                     ← Web clips, saved articles
│   ├── transcripts/                  ← Audio/video transcripts
│   ├── code-packs/                   ← Code snippets and repositories
│   ├── assets/                       ← Images, PDFs, binary files
│   └── private/                      ← AES-256-CTR encrypted — owner only
│
└── wiki/                             ← LLM-owned — agent creates and maintains
    ├── index.md                      ← Primary catalog (read first every session)
    ├── log.md                        ← Append-only operations ledger
    ├── sources/                      ← One page per processed raw source
    ├── entities/                     ← People, tools, organisations, projects
    ├── concepts/                     ← Patterns, theories, architectural decisions
    ├── queries/                      ← Preserved answers and conflict notes
    └── private/                      ← AES-256-CTR encrypted — owner only

Three layers

Layer	Path	Owner	Rule
Raw sources	`raw/`	Human	Immutable. LLM reads only. Never modified.
Wiki	`wiki/`	LLM	Agent creates, updates, cross-links, maintains.
Schema	`AGENTS.md`	Human + LLM	Co-evolved contract defining structure and workflows.

Linked projects (optional)

A genome can optionally declare a linked project repository — a separate repo where the knowledge in that genome is meant to be applied (e.g. genome-dev linked to an app repo). The link is recorded as a third field in the registry and rendered into the genome's AGENTS.md (## Linked Project). A genome with no link is knowledge-only and behaves exactly as before. See Configuration.

Framework structure

knowledge-genome-orchestrator/        ← This repository (setup tooling)
├── globals.env                       ← Static KEY=VALUE config (Make-includable)
├── registry.sh                       ← Bash-only: GENOMES array + dynamic paths
├── Makefile                          ← Entry point for all operations
├── lib/
│   ├── output.sh                     ← Terminal helpers (colors, log levels)
│   ├── deps.sh                       ← Dependency validation
│   ├── scaffold.sh                   ← Template rendering engine
│   ├── structure.sh                  ← Canonical genome layout (single source of truth)
│   ├── lint.sh                       ← Per-file validation functions
│   └── git-crypt.sh                  ← git-crypt lifecycle (init, export, verify, rotate)
├── providers/
│   ├── forgejo.sh                    ← Forgejo REST API provider
│   └── github.sh                     ← GitHub REST API provider
├── scripts/
│   ├── setup.sh                      ← Main entry point
│   ├── setup-master.sh               ← Master repo initialisation
│   ├── setup-genomes.sh              ← Genome provisioning loop
│   ├── add-genome.sh                 ← Add a single new genome
│   ├── lint-genomes.sh               ← Quality control across all genomes
│   └── verify-genomes.sh             ← Structure verify / --sync across all genomes
├── templates/
│   ├── agents-genome.md              ← Per-genome agent contract template
│   ├── agents-master.md              ← Master coordination schema template
│   ├── readme-master.md              ← Master repo README template
│   ├── wiki-index.md                 ← Index template (rendered per genome)
│   ├── wiki-log.md                   ← Log template (rendered per genome)
│   ├── pr-description.md             ← PR review checklist template
│   ├── pre-commit.sh                 ← Security hook template
│   ├── gitattributes                 ← Git encryption rules template
│   └── gitignore                     ← Git ignore template
├── skills/
│   └── ingest/                       ← pi skill: deployed to the AI node (vm101)
│       ├── SKILL.md                  ← Semantic-only contract (read/edit, emits manifest)
│       ├── references/               ← On-demand reference docs for the agent
│       └── scripts/                  ← Deterministic post-processor (runs outside the agent)
│           ├── run-ingest.sh         ← Orchestrator: consumes the manifest, emits one JSON line
│           ├── slug.sh               ← Slug normalisation
│           ├── index-append.py       ← Sorted insert into wiki/index.md + last_updated bump
│           ├── log-append.sh         ← Append a wiki/log.md entry
│           ├── scoped-lint.sh        ← Lint only the pages touched this run (reuses lib/lint.sh)
│           └── open-pr.sh            ← Branch / commit / push / open PR (DRY_RUN seam for tests)
└── tests/                            ← bats suite — deterministic, no LLM/GPU (see Testing)
    ├── helpers.bash
    ├── scripts.bats
    ├── lint.bats
    ├── structure.bats
    └── run-ingest.bats

The skills/ingest/ directory is version-controlled here but deployed to the AI node (vm101) under ~/.pi/agent/skills/ingest. The agent (pi) does only semantic work and writes a manifest; run-ingest.sh does the mechanical steps. See Workflows → Ingest.

ingest-semantic.py: one schema-constrained call to local model, returns JSON. run-ingest.sh: index/log/lint/PR. Semantic JSON extraction → deterministic wiki conform + manifest.

cp skills/ingest/* ~/.pi/agent/skills/ingest/ after make setup. Updated via git pull on laptop, pushed to vm101 via SSH in n8n flow.

System Requirements

Linux — full support (primary target)

All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+. All tools (git-crypt, bw, qmd) have native Linux binaries.

macOS — full support

All scripts are compatible with macOS. Requirements:

bash 3.2+ (macOS default) — supported for the setup scripts (make targets, scaffolding). Two things need bash 4+: the ingest skill (mapfile), which runs on the Linux AI node (not a constraint on the macOS setup machine); and gcrypt_rotate_key (compgen -G), which does run on the laptop. For key rotation on macOS, use Homebrew bash (brew install bash).
GNU coreutils not required — BSD variants of date, grep, sed all handled.
git-crypt: install via Homebrew — brew install git-crypt
jq, curl: pre-installed or via Homebrew

If you use Homebrew bash (brew install bash), the scripts work identically to Linux.

Windows — WSL2 only

Git Bash and native Windows are not supported.

Reasons:

git-crypt has no native Windows binary.
Process substitution <(...) used for runtime key injection is not available in Git Bash or PowerShell.
Several bash builtins used throughout (compgen, BASH_SOURCE, arrays) are not available outside a POSIX-compliant shell.

WSL2 (Windows Subsystem for Linux) with Ubuntu gives full compatibility. All setup and runtime operations work identically to native Linux inside WSL2.

Hardware recommendations

The system is designed for a homelab architecture:

Component	Recommended	Role
Storage node	Any Linux server with NFS	Hosts Forgejo, stores genome repos
AI compute node	GPU server (16GB+ VRAM)	Runs local LLM agent sessions
VRAM	16GB minimum	14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache
Local LLM	14B–32B quantised	Active wiki maintenance sessions
Large LLM	70B (async)	Deep reflection, complex synthesis (scheduled, not interactive)

On VRAM constraints: with a 16GB card and a 14B model, the KV cache budget is ~6GB — approximately 32k tokens of effective context. Every token in AGENTS.md, the index, and the log tail is a cost. This is why all agent files are token-optimised and sessions are kept to one source at a time.

Reference deployment: the table above is a target profile, not a hard requirement. The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they just make the "one source per session" discipline and the token budget matter more.

Prerequisites

Required

Tool	Purpose
`git`	Version control
`git-crypt`	Transparent file encryption
`curl`	REST API calls to Forgejo/GitHub
`jq`	JSON parsing

Optional

Tool	Purpose
`bw`	Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk)
`qmd`	Local BM25 + vector search for Markdown files with MCP server interface

bw vs bws: Use bw (standard Bitwarden CLI). bws is the Bitwarden Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.

Install on Ubuntu/Debian

sudo apt update && sudo apt install -y git git-crypt curl jq

Install on macOS

brew install git git-crypt curl jq

Install Bitwarden CLI

# Linux
npm install -g @bitwarden/cli

# macOS
brew install bitwarden-cli

Verify all tools

make doctor

Configuration

Configuration is split into two files with distinct purposes:

`globals.env` — static KEY=VALUE

Safe for make include, docker-compose, shell source, and any standard env parser. Contains only simple scalar values — no bash syntax, no arrays.

# Provider selection
PROVIDER=forgejo            # forgejo | github

# Forgejo (active when PROVIDER=forgejo)
FORGEJO_URL=https://git.yourserver.com
FORGEJO_USER=yourusername
FORGEJO_SSH_PORT=222        # Default for many homelab Forgejo setups; 22 for standard

# GitHub (active when PROVIDER=github — uncomment to use)
# GITHUB_USER=your-username
# GITHUB_ORG=your-org       # Optional: for org repos, overrides GITHUB_USER

# Vaultwarden
VAULTWARDEN_URL=https://vault.yourserver.com

# Master repository
MASTER_REPO=master-knowledge-genome
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git

`registry.sh` — bash runtime config

Sourced by shell scripts only. Contains the genome registry array and dynamic path resolution. Never included by Make.

# Dynamic paths (resolved at source time)
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
KEYS_DIR="${WORK_DIR}/keys"

# Genome registry — format: "name|description|linked_repo"
# The third and fourth fields are OPTIONAL:
#   - leave it empty  → knowledge-only genome (no linked project)
#   - owner/repo      → genome is linked to that project repository (rendered into AGENTS.md)
#   - cross_source    → yes|no (default no): whether the cross-genome collector may read this genome as a source
GENOMES=(
  "genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app|no"
  "genome-finance|Personal finance, investments, market analysis||no"
  "genome-homelab|Infrastructure, network configs, architecture logs||no"
)

To add a genome to the registry before running setup, append a line to GENOMES. After initial setup, use make add-genome instead.

Tokens

Tokens are never stored in config files. Export them in your shell before running setup:

export FORGEJO_TOKEN="your_forgejo_token"
# or
export GITHUB_TOKEN="your_github_token"

Quick Start

# 1. Clone the setup framework
git clone <setup-repo-url> knowledge-genome-orchestrator
cd knowledge-genome-orchestrator

# 2. Configure your environment
cp globals.env.example globals.env   # edit with your values
# Edit registry.sh to define your genomes

# 3. Export your provider token
export FORGEJO_TOKEN="your_token_here"

# 4. Verify dependencies
make doctor

# 5. Run full setup
make setup

make setup executes in order:

Dependency check — verifies all required tools are installed
Git identity check — warns if user.name / user.email are not configured
Master repo — creates master-knowledge-genome on Forgejo, scaffolds with AGENTS.md and README.md, initialises git, adds core-karpathy as submodule, pushes
Genome provisioning — for each genome in registry.sh:
- Creates remote repository on Forgejo
- Adds it as a submodule in the master repo
- Initialises git-crypt (before any files are created)
- Scaffolds directory structure and renders all templates
- Installs pre-commit security hook
- Commits, pushes genome to remote
- Exports symmetric key to keys/<genome>.key
- Prints Vaultwarden upload instructions
- Commits submodule pointer in master repo

After setup completes:

Upload all files in keys/ to Vaultwarden (see Key Management)
Delete key files from disk: rm keys/*.key

Makefile Reference

Target	Description
`make setup`	Full system initialisation — master repo + all genomes in `registry.sh`
`make add-genome NAME=x DESC="y" [LINKED=owner/repo]`	Scaffold and register a single new genome (optional linked project)
`make lint`	Run quality checks across all genomes (schema, privacy, decay, page size)
`make verify-structure`	Report directory drift of each genome vs the canonical layout (`lib/structure.sh`)
`make sync-structure`	Create any missing canonical directories across all genomes (safe, idempotent)
`make test`	Run the bats test suite (deterministic; no LLM/GPU/network) — see Testing
`make status`	Show submodule status and per-genome git-crypt encryption state
`make lock`	Lock all encrypted repos (master + all genome submodules)
`make doctor`	Verify required tools: git, git-crypt, curl, jq; warn if bw missing
`make sync`	`git submodule update --init --recursive` + report unpushed commits per genome
`make help`	Print all available targets

Examples

# Check system health
make doctor

# Add a new genome after initial setup
make add-genome NAME=genome-research DESC="Academic papers and deep research"

# Add a genome linked to a project repository
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app

# Check every genome against the canonical directory layout
make verify-structure

# Run full lint pass (bash deterministic checks)
make lint

# Sync all nodes after pulling on another machine
make sync

# Emergency lock — secures all repos before leaving a session
make lock

Testing

The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is covered by a bats suite. The tests are deterministic and have zero dependency on the LLM, the GPU, or the network — they simulate the agent's output with fixtures and exercise the scripts directly, so they run anywhere git + bash live (laptop, CI, a git hook). They are not meant to run on the AI node or via n8n.

sudo apt install bats        # once
make test                    # or: bats tests/

File	Covers
`scripts.bats`	`slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent)
`lint.bats`	`lib/lint.sh` validators + `scoped-lint.sh`
`structure.bats`	`lib/structure.sh` report / sync
`run-ingest.bats`	`run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq`

Each test builds its own throwaway genome with a local bare remote, configured to ignore the operator's global git settings (signing, global hooks) so the suite is hermetic. The run-ingest tests auto-skip if jq is absent. If you change the canonical layout in lib/structure.sh, update FIXTURE_DIRS in tests/helpers.bash to match.

Why this matters: the only non-deterministic part of the system is the model. Pinning the mechanical layer with tests means that when an ingest misbehaves, you know it's the model or the prompt — not the plumbing.

Genome Lifecycle

Initial setup

All genomes defined in registry.sh are provisioned by make setup.

Adding a genome after initial setup

make add-genome NAME=genome-newname DESC="Domain description"

This: creates the remote repo, adds it as a submodule, initialises git-crypt, scaffolds the directory structure, installs the pre-commit hook, commits and pushes, exports the key, and commits the submodule pointer in master.

After adding: upload the new key to Vaultwarden and delete the key file.

Removing a genome

Manual process:

# In master repo
git submodule deinit genome-name
git rm genome-name
git commit -m "chore: remove genome-name submodule"
git push
# Archive or delete the remote repository on Forgejo

Template rendering

When a genome is scaffolded, render_template replaces these placeholders in all template files:

Placeholder	Source	Example
`{{GENOME_NAME}}`	registry.sh	`genome-dev`
`{{GENOME_NAME_UPPER}}`	derived	`GENOME-DEV`
`{{GENOME_DESC}}`	registry.sh	`Web development...`
`{{LINKED_PROJECT}}`	registry.sh	`myorg/my-app` (or `none`)
`{{FORGEJO_URL}}`	globals.env	`https://git.yourserver.com`
`{{FORGEJO_USER}}`	globals.env	`yourusername`
`{{VAULTWARDEN_URL}}`	globals.env	`https://vault.yourserver.com`
`{{MASTER_REPO}}`	globals.env	`master-knowledge-genome`
`{{DATE}}`	runtime	`2026-05-11`

Security Model

Encryption architecture

Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt. Two directories in every genome are always encrypted:

Directory	Contents	On remote
`raw/private/`	Sensitive source material	Opaque binary blob
`wiki/private/`	Private synthesis and notes	Opaque binary blob

All other directories (raw/articles/, wiki/sources/, etc.) are plaintext. Collaborators without the key can contribute to public directories normally — git handles encrypted files transparently.

`.gitattributes` — dynamic encryption rules

Encryption rules use a glob wildcard that catches any private/ directory at any depth in the repository — including directories created at runtime by the LLM:

# Text rules first
*.md     text eol=lf
*.sh     text eol=lf

# Encryption rules LAST (later rules override per-attribute)
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
**/private/**   filter=git-crypt diff=git-crypt -text

Rule ordering matters: in .gitattributes, the last matching rule wins per attribute. Encryption rules must come after text rules so -text overrides text eol=lf for encrypted markdown files.

Pre-commit hook — dynamic validation

The security hook installed at .git/hooks/pre-commit validates every staged file dynamically — it reads encryption requirements from .gitattributes at runtime rather than checking hardcoded paths:

# For each staged file, check if git-crypt encryption is required
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
if [[ "$filter" == "git-crypt" ]]; then
    # Verify the file is actually encrypted
    if git-crypt status "$file" | grep -q "not encrypted"; then
        # BLOCK THE COMMIT
    fi
fi

This means: any file matching **/private/** in .gitattributes is protected, including future private/ directories created anywhere in the repo. The hook never needs updating when the encryption rules change.

Untrusted agent output — manifest validation

The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field, a wrong type, or a malicious path such as wiki/../../etc/passwd. run-ingest.sh therefore validates the manifest before trusting any field — it must be well-formed JSON with a string raw_source and an array pages, and every path must be a string under wiki/ with no ... Anything else fails fast with a structured {"status":"error"} and no filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the knowledge tree. This is the trust boundary between the (stochastic) model and the (deterministic, tested) post-processor.

PRIVATE_CONTEXT toggle

The PRIVATE_CONTEXT toggle in AGENTS.md controls whether the LLM agent accesses encrypted directories. It must be declared explicitly by the operator at the start of every session:

PRIVATE_CONTEXT: disabled   ← Default. private/ directories are treated as non-existent.
PRIVATE_CONTEXT: enabled    ← Agent may read/write private/. Requires git-crypt unlock.

Rules:

Never inferred. Never carried over from a previous session.
enabled requires the operator to confirm that git-crypt unlock has run on the host.
Per-genome, per-session: enabling for genome-finance does NOT enable for genome-dev.
Cloud LLM models: PRIVATE_CONTEXT must always be disabled. Private data never leaves the local network.
All outputs derived from private data are prefixed [PRIVATE DATA INCLUDED].
Private synthesis goes exclusively to wiki/private/ — never to public wiki paths.

Runtime key injection — zero disk policy

Encryption keys are never stored as persistent files on the AI server. They are injected at session start via the Bitwarden CLI (bw) against your self-hosted Vaultwarden instance, using process substitution:

# Step 1: authenticate
bw config server https://vault.yourserver.com
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)

# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
git-crypt unlock <(
  bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)

The key flows: Vaultwarden → bw get notes → base64 -d → kernel pipe → git-crypt. At no point is the key written to any file on disk.

Lock a genome when the session ends:

git-crypt lock

Key Management

This section is for the operator. These commands are never issued by the LLM agent.

Vaultwarden Secure Notes

Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:

Genome	Vaultwarden Note Name
`genome-dev`	`genome-dev key`
`genome-finance`	`genome-finance key`
`genome-homelab`	`genome-homelab key`

After make setup or make add-genome, key files are exported to keys/. Upload procedure:

# Encode the key
base64 < keys/genome-dev.key

# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
# Then delete the key file
rm keys/genome-dev.key

Cloning on a new machine

# Full clone with all submodules
git clone --recurse-submodules \
  https://git.yourserver.com/yourusername/master-knowledge-genome.git

# Unlock a specific genome (with key file — development only)
cd master-knowledge-genome/genome-dev
git-crypt unlock /path/to/genome-dev.key

# Unlock via Vaultwarden (recommended — no key on disk)
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)

# Sparse clone — collaborator who only needs one genome
git clone https://git.yourserver.com/yourusername/genome-dev.git

Key rotation (emergency)

If a key is lost or compromised:

# From the knowledge-genome-orchestrator/ directory
source lib/git-crypt.sh
# If gcrypt_rotate_key operates on the CWD: cd into .../master-knowledge-genome/genome-dev
# If it navigates by name instead:          cd into .../master-knowledge-genome
cd ~/knowledge-genome-orchestrator/master-knowledge-genome
gcrypt_rotate_key "genome-dev"

macOS: gcrypt_rotate_key uses compgen -G (bash 4+). The stock macOS bash 3.2 is not enough — run rotation under Homebrew bash (brew install bash).

gcrypt_rotate_key performs:

Unlocks repo with existing key
Removes old key material
Generates new symmetric key via git-crypt init
Re-stages and commits private files (encrypted with new key)
Exports new key to keys/
Prints Vaultwarden update instructions

Limitation: git history still contains blobs encrypted with the old key. Anyone with the old key and git history access can decrypt them. To purge old encrypted blobs from history:
git filter-repo --invert-paths --path raw/private --path wiki/private
git push --force origin main
This rewrites all commit hashes — coordinate with any collaborators first.

After rotation:

Upload new key to Vaultwarden (replace existing note)
Delete both keys/genome-dev.key and keys/genome-dev-rotated-*.key from disk
Revoke access from previous key holders

Agent Sessions

Prerequisites for every session

Before starting an LLM agent session on a genome:

The host (AI server) runs git-crypt unlock for the required genomes
The orchestrator prepares context: tail -n 20 wiki/log.md
Declare PRIVATE_CONTEXT state explicitly in the opening prompt

Session start protocol

The agent executes in this order at the start of every session:

Read wiki/index.md — primary catalog of all pages and maturity
Read last 20 log entries (injected by orchestrator — does NOT open wiki/log.md directly)
For tasks involving related pages: if the optional qmd extension is installed, qmd search "<query>" before opening files; otherwise navigate from wiki/index.md
Operate on individual files — never scan entire directories

One source per session

With a 14B model and ~6GB KV cache budget, long sessions degrade. As the session extends, the context fills with pages already created, attention dilutes, and later entities receive worse cross-references than earlier ones.

Hard rule: one source per session. If multiple sources are queued in raw/, process only the first. Commit, close the session. The orchestrator (n8n or script) starts a new session for the next source with a clean KV cache.

For automated pipelines: if 5 files arrive in raw/, trigger 5 agent sessions sequentially — not one session with 5 files.

n8n automation

For Forgejo webhook → automated ingest:

Forgejo sends webhook on push to raw/
n8n receives webhook, identifies new files
n8n starts one agent session per new file (sequential, not parallel)
Each session: realign the checkout to the base (git switch <base> && git reset --hard origin/<base>), then inject tail -n 20 wiki/log.md + PRIVATE_CONTEXT state + source path
Phase 1 agent (/skill:ingest) writes the manifest; Phase 2 run-ingest.sh opens the PR, then stops
Human reviews — merge to accept, or close the PR + delete the feat branch to reject

Workflows

Ingest

Triggered by a new file in raw/ (manual or via webhook). Ingest is split into two phases so that the small local model spends its limited context only on judgement, and all the deterministic bookkeeping happens outside the model's loop.

Phase 1 — agent (semantic only). The ingest skill gives the agent read/edit tools only (no shell). It:

Reads the source once
Creates wiki/sources/<slug>.md — summary and key points
Per entity (person, tool, organisation): creates or updates wiki/entities/<name>.md
Per concept (pattern, theory, decision): creates or updates wiki/concepts/<name>.md
Checks each touched page for contradictions → applies Conflict Resolution if found
Writes .ingest-manifest.json (the list of pages it created/modified, the model name, a one-line reasoning, the PR summary, and any contradictions) — then stops

Phase 2 — run-ingest.sh (deterministic, outside the agent). The post-processor first validates the manifest — well-formed JSON, expected shape, and every page path confined to wiki/ with no .. (see Security Model) — then does the mechanical work the model must not waste context on:

Inserts each page into the correct wiki/index.md section in alphabetical order, deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the index last_updated (index-append.py)
Appends the INGEST | <slug> entry to wiki/log.md (the model name comes from the orchestrator via INGEST_MODEL — the agent cannot reliably know its own tag)
Runs scoped lint on exactly the pages touched this run (scoped-lint.sh, reusing lib/lint.sh)
Commits only wiki/ on feat/ai-ingest-<slug> and opens a PR against the integration base (INGEST_BASE, default main); the body matches the templates/pr-description.md structure (Summary / Pages / Contradictions / Scoped Lint)
Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n

The agent never runs git, never edits the index/log mechanically, and never lints — those are deterministic and tested (see Testing). Invocation on the AI node:

pi --mode json -p "/skill:ingest raw/articles/<file>.md"   # phase 1 → writes manifest
run-ingest.sh <genome>                                     # phase 2 → index/log/lint/PR

For private sources (PRIVATE_CONTEXT: enabled required):

All output goes to wiki/private/<slug>.md only
PR title: [PRIVATE] ingest: <slug>

Branch lifecycle & the manual gate. run-ingest.sh / open-pr.sh are deliberately "dumb": they create the feat/ai-ingest-<slug> branch, commit only wiki/, open the PR, and stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to the orchestrator, around the human gate:

Before each session the orchestrator realigns the checkout to the base (git fetch && git switch <base> && git reset --hard origin/<base>) — a reset of the local checkout to match the remote, never a force-push to the shared branch.
After the PR opens, everything stops until a human approves: one source per session, sequential, no new ingest until the pending PR is closed.
Approve = merge. Reject = close the PR and delete the remote feat branch. To undo an already-merged ingest, open a revert PR against the base — never rewrite history on a shared branch.

The PR base is configurable via INGEST_BASE (default main). Per-page maturity already encodes stability and tags/releases mark versioned snapshots, so main is the integration branch today. If a linked project later consumes a genome, set INGEST_BASE=develop to buffer ingests on develop and cut manual develop → main releases — no code change.

Query

Triggered by an operator question.

qmd search "<query>" (if the optional qmd extension is installed) → identify candidate pages; otherwise start from wiki/index.md
Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
Synthesise answer with [[wikilink]] citations
If answer is non-trivial: save as wiki/queries/<slug>.md and append to index
Append log entry: QUERY | <subject>

For general orientation without a specific query: read wiki/index.md directly.

Lint

The lint workflow is split between deterministic bash checks and semantic LLM judgment.

Step 1 — operator runs bash linter:

make lint

The bash linter checks automatically:

YAML frontmatter validity (all mandatory fields present)
Domain consistency (domain field matches genome name)
Type validity (value from allowed list)
Privacy consistency (private/ directories have private: true)
Page size (warn at 400 lines, error at 800 lines)
Knowledge decay (stable > 180 days, draft > 90 days)
Broken internal wikilinks (warnings only — cross-type links produce expected false positives)

Step 2 — operator provides bash output to LLM agent:

The agent applies semantic judgment to findings the bash linter cannot make:

Orphan pages (from bash list): for each orphan, identify 1-3 existing pages that should link to it; propose specific additions
Implicit concepts (from bash term frequency list): determine if a candidate term warrants a dedicated page; draft stub if yes
Duplicate concepts: qmd search "<concept>" for suspected duplicates; propose merge if confirmed
Maturity promotion: pages with 2+ sources still marked draft → propose stable

The agent reports all findings as a structured list. It does not modify files without operator approval. Appends LINT | <summary> log entry.

Knowledge Quality

PR review workflow

Every agent session that modifies wiki pages opens a PR. The PR description uses templates/pr-description.md:

## Summary

One sentence: goal of this session and source processed.

## Pages Created

| Path | Type | Maturity |

## Pages Modified

| Path | Change |

## Contradictions Found

[ ] None / [ ] n conflict file(s) created

## Private Data Accessed

[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes

## Scoped Lint (post-ingest)

[ ] Frontmatter valid [ ] No broken links [ ] No issues found

This makes human review fast and structured: read the table, scan the diff, approve or request changes. No exploration required to understand what the agent did.

Conflict resolution

When new evidence contradicts an existing wiki claim:

Keep the existing page unchanged
Create wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md with:
- The existing claim and its source
- The contradicting evidence and its source
- Agent confidence assessment for each
- Recommendation: accept_b | keep_a | requires_human_review
Add entry to wiki/index.md → Conflicts Pending Review section
Log entry: CONFLICT | <concept>
Open PR: [CONFLICT] <concept> — human review required

The operator resolves the conflict, updates relevant pages, closes the PR.

Knowledge decay

Pages have a last_updated field in frontmatter. During lint passes:

Maturity	Threshold	Action
`stable`	180 days	Flag as stale — add `⚠️ STALE` callout
`draft`	90 days	Flag as stale — add `⚠️ STALE` callout

The agent proposes re-validation but does not change maturity without new source evidence.

Cross-genome references

Cross-domain knowledge moves by pull, never push: the genome you are working in draws material in; nothing is ever written into another genome. There are no cross-genome wikilinks — submodule pointers make relative paths brittle.

When the working genome needs a concept that lives elsewhere, the navigation skill handles it in the same two-phase shape as ingest:

A deterministic collector clones the relevant genomes read-only at HEAD (fresh — never the pinned submodule state) and assembles a dossier of excerpts with provenance.
A semantic pass reads only that dossier; the skill then deposits one abstract, non-private raw into the working genome at raw/articles/crossgen-<topic>-<date>.md.
That raw goes through the working genome's normal ingest → PR → human gate, like any source.

Which genomes may be read as sources is gated by a per-genome cross_source: yes|no flag: a confidential genome (e.g. a client file) is marked no and is never read as a source — the wall is structural, not a matter of the agent's discipline. The master AGENTS.md holds the full boundary contract.

Knowledge Schema

Frontmatter

Every wiki page must start with valid YAML frontmatter:

---
title: "Strict String Title"
type: source | entity | concept | query | conflict | private
domain: genome-name
tags: [lowercase, hyphen-separated]
maturity: draft | stable | deprecated
last_updated: YYYY-MM-DD
private: true | false
---

Field	Rules
`type`	Must be one of: `source entity concept query conflict private index log`
`maturity: draft`	Single source or unvalidated
`maturity: stable`	Confirmed by 2+ independent sources
`maturity: deprecated`	Superseded — add `> DEPRECATED: <reason>` callout at top
`private: true`	Required on all pages in `wiki/private/` and `raw/private/`

Do not use semantic versioning for content. Git history tracks every change. maturity captures epistemic state; last_updated tracks recency.

Page types and directories

Type	Directory	Description
`source`	`wiki/sources/`	One page per processed raw source
`entity`	`wiki/entities/`	People, tools, organisations, projects
`concept`	`wiki/concepts/`	Patterns, theories, architectural decisions
`query`	`wiki/queries/`	Preserved answers and analyses
`conflict`	`wiki/queries/conflict-*.md`	Unresolved contradictions
`private`	`wiki/private/`	Private synthesis (PRIVATE_CONTEXT: enabled)
`index`	`wiki/index.md`	Primary navigation catalog (singleton)
`log`	`wiki/log.md`	Operations ledger (singleton)

Page size limits

Limit	Lines	Action
Soft cap	400	Bash linter warns
Hard cap	800	Bash linter errors — split the page

These limits ensure pages fit within the LLM context window without attention degradation and keep the wiki atomically navigable.

Linking conventions

Intra-genome: [[folder/file]] — Obsidian wikilinks only.
Cross-genome: NOT supported via wikilink — submodule pointers make relative paths brittle. When the working genome needs a concept that lives elsewhere, the navigation skill pulls it in as one abstract raw under this genome's raw/articles/, which then goes through normal ingest. See Cross-genome references.
External: [text](https://...) — standard Markdown.

Log format

Every operation appends one entry to wiki/log.md:

## [YYYY-MM-DD] TYPE | Subject

- run_id: `<uuid>`
- model: `<model-name>`
- context_read: `[[path/A]]`, `[[path/B]]`
- output_written: `[[path/C]]`
- reasoning: One sentence — what changed and why.

Valid TYPEs: INGEST LINT QUERY CONFLICT CONFIG SECURITY

Parse examples:

grep "^## \[" wiki/log.md | tail -5          # Last 5 entries
grep "^## \[" wiki/log.md | grep "CONFLICT"  # All conflicts
grep "^## \[2026-05" wiki/log.md             # Entries from a specific month

ingest-semantic.py receives source text + existing entity/concept names (from index) as prompt context. The LLM never loads the full log.

Collaboration Model

Role	Key access	Permitted operations
Owner	Full — key holder	Read/write everywhere
Collaborator	None	Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/`
Local AI agent	Conditional	`private/` only when `PRIVATE_CONTEXT: enabled`
Cloud AI model	Never	`PRIVATE_CONTEXT` must be `disabled`; private data stays on local network

Grant collaborator access: add as Forgejo contributor with Write role. Never share the git-crypt key — collaborators operate exclusively in public directories.

Optional Extensions

qmd — local Markdown search

qmd is a local, on-device BM25 + vector search engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls) and an MCP server (for native LLM tool use).

Recommended at scale: once a genome exceeds ~150 pages, qmd search is significantly faster and more accurate than navigating wiki/index.md manually.

# Index a genome
qmd index genome-dev/wiki/

# Search
qmd search "graph-based state management"

# Start MCP server (for Claude Code / Codex integration)
qmd serve --port 3333

Obsidian integration

Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.

Recommended setup:

Graph view — visualise page connections; spot orphans and hubs instantly
Obsidian Web Clipper — browser extension to clip articles directly to raw/articles/ as Markdown
Download attachments — Settings → Hotkeys → "Download attachments for current file". Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to raw/assets/
Dataview plugin — query YAML frontmatter across the wiki; TABLE maturity, last_updated WHERE domain = "genome-dev" generates dynamic tables
Marp plugin — render Markdown as slide decks directly from wiki content

Note: .obsidian/ is in .gitignore. Workspace and plugin settings are local — not synced.

n8n automation

n8n → SSH → ingest-semantic.py → run-ingest.sh .

n8n (running on the storage node) can automate the ingest pipeline:

Forgejo webhook fires on push to a genome's raw/ directory
n8n flow identifies new files
For each new file: starts one agent session (sequential — never parallel)
Each session receives: tail -n 20 wiki/log.md + PRIVATE_CONTEXT state + source path
Phase 1 — agent runs /skill:ingest (semantic → writes manifest); Phase 2 — run-ingest.sh does index/log/lint and opens the PR, returning one JSON line to n8n
Human reviews the PR

Key constraint: one source per session, sessions sequential. Never batch multiple sources into one agent session.

Intel NPU offloading

If the AI compute node has an Intel NPU (e.g. Core Ultra series):

Background/auxiliary tasks (OCR of raw/assets/, async summarisation, or qmd re-indexing if the optional qmd extension is in use) → Intel NPU via OpenVINO
Active reasoning sessions (ingest, query, synthesis) → GPU

Note: the core system has no embedding pipeline (see Core Philosophy), so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the GPU's KV cache free for interactive sessions and lowers power draw for background jobs.

Troubleshooting

`git-crypt: command not found`

# Ubuntu/Debian
sudo apt install git-crypt

# macOS
brew install git-crypt

`make setup` fails with "MISSING: jq"

make doctor   # identifies all missing tools
sudo apt install git git-crypt curl jq

Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"

The staged file is in a path matching **/private/** but is not encrypted.

Fix options:

Verify .gitattributes contains **/private/** filter=git-crypt diff=git-crypt -text
Run git-crypt init if git-crypt is not initialised in this repo
Run git-crypt status to check the encryption state of all files

Never use git commit --no-verify to bypass this check.

`git-crypt status` shows files as "not encrypted" after init

The .gitattributes rule must be committed before files in private/ are staged. If files were staged before .gitattributes was committed:

git rm -r --cached raw/private/ wiki/private/
git add raw/private/ wiki/private/
git commit -m "fix: re-stage private files for encryption"

Agent returns stale or missing cross-references

Likely causes:

Session was too long — KV cache degraded. Use one source per session.
wiki/index.md was not read at session start — agent lacked the page catalog.
qmd index is stale — re-index: qmd index <genome>/wiki/

Submodules show as "modified" after `make sync`

This is normal if genome repos have new commits. Update master's pointers:

cd master-knowledge-genome
git add .
git commit -m "chore: update submodule pointers"
git push

bw unlock fails

Verify you are using bw (standard Bitwarden CLI), not bws (Secrets Manager CLI). bws does not work with self-hosted Vaultwarden.

bw --version     # should print e.g. "2024.x.x"
bw config server https://vault.yourserver.com
bw login

51 KiB Raw Blame History Unescape Escape

Knowledge Genome System

Table of Contents

Core Philosophy

Architecture

Repository structure

Three layers

Linked projects (optional)

Framework structure

System Requirements

Linux — full support (primary target)

macOS — full support

Windows — WSL2 only

Hardware recommendations

Prerequisites

Required

Optional

Install on Ubuntu/Debian

Install on macOS

Install Bitwarden CLI

Verify all tools

Configuration

globals.env — static KEY=VALUE

registry.sh — bash runtime config

Tokens

Quick Start

Makefile Reference

Examples

Testing

Genome Lifecycle

Initial setup

Adding a genome after initial setup

Removing a genome

Template rendering

Security Model

Encryption architecture

.gitattributes — dynamic encryption rules

Pre-commit hook — dynamic validation

Untrusted agent output — manifest validation

PRIVATE_CONTEXT toggle

Runtime key injection — zero disk policy

Key Management

Vaultwarden Secure Notes

Cloning on a new machine

Key rotation (emergency)

Agent Sessions

Prerequisites for every session

Session start protocol

One source per session

n8n automation

Workflows

Ingest

Query

Lint

Knowledge Quality

PR review workflow

Conflict resolution

Knowledge decay

Cross-genome references

Knowledge Schema

Frontmatter

Page types and directories

Page size limits

Linking conventions

Log format

Collaboration Model

Optional Extensions

qmd — local Markdown search

Obsidian integration

n8n automation

Intel NPU offloading

Troubleshooting

git-crypt: command not found

make setup fails with "MISSING: jq"

Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"

git-crypt status shows files as "not encrypted" after init

Agent returns stale or missing cross-references

Submodules show as "modified" after make sync

bw unlock fails

51 KiB

Raw Blame History

`globals.env` — static KEY=VALUE

`registry.sh` — bash runtime config

`.gitattributes` — dynamic encryption rules

`git-crypt: command not found`

`make setup` fails with "MISSING: jq"

`git-crypt status` shows files as "not encrypted" after init

Submodules show as "modified" after `make sync`