knowledge-genome-orchestrator/README.md

1241 lines
51 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Knowledge Genome System
> A distributed, encrypted, multi-domain personal knowledge base.
> No vector database. No embedding pipeline. No external retrieval server.
Built on the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
by Andrej Karpathy — extended with a multi-domain submodule architecture,
AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection,
and a human-in-the-loop Git Flow for quality control.
---
## Table of Contents
1. [Core Philosophy](#core-philosophy)
2. [Architecture](#architecture)
3. [System Requirements](#system-requirements)
4. [Prerequisites](#prerequisites)
5. [Configuration](#configuration)
6. [Quick Start](#quick-start)
7. [Makefile Reference](#makefile-reference)
8. [Testing](#testing)
9. [Genome Lifecycle](#genome-lifecycle)
10. [Security Model](#security-model)
11. [Key Management](#key-management)
12. [Agent Sessions](#agent-sessions)
13. [Workflows](#workflows)
14. [Knowledge Quality](#knowledge-quality)
15. [Knowledge Schema](#knowledge-schema)
16. [Collaboration Model](#collaboration-model)
17. [Optional Extensions](#optional-extensions)
18. [Troubleshooting](#troubleshooting)
---
## Core Philosophy
Most RAG systems make the LLM rediscover knowledge from scratch on every query.
A document is indexed; at query time, relevant chunks are retrieved; an answer is generated.
Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM
pieces it together from fragments every single time.
This system is different. Instead of retrieval at query time, the LLM
**incrementally builds and maintains a persistent wiki** that sits between you and the raw
sources. When a new source arrives, the LLM reads it, extracts key information, updates
entity and concept pages, flags contradictions with existing claims, and strengthens the
evolving synthesis. Knowledge is compiled once and kept current.
**The wiki is a compounding artifact.** Cross-references are already there.
Contradictions have been flagged. The synthesis already reflects everything ingested.
This means:
- No vector database.
- No embedding pipeline.
- No external retrieval infrastructure.
The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
(~100 sources, hundreds of pages) this performs better than RAG because cross-references,
contradictions, and syntheses are already resolved — not re-derived per query.
The human's job: curate sources, direct analysis, ask good questions, review PRs.
The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.
---
## Architecture
### Repository structure
```text
master-knowledge-genome/ ← Root orchestrator (submodule registry)
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/ ← Submodule: web development, Angular, TUI
├── genome-finance/ ← Submodule: personal finance, investments
├── genome-homelab/ ← Submodule: Keru infrastructure, network configs
└── AGENTS.md ← Global coordination schema (cross-genome rules)
```
> The genome names above (`genome-dev`, `genome-finance`, `genome-homelab`) are
> **illustrative** — they show the kind of multi-domain layout this orchestrator targets.
> The shipped `registry.sh` defines a single disposable sandbox, **`genome-test`**; you
> create real genomes yourself with `make add-genome` (see the registry examples below).
Each genome is an independent git repository:
```text
genome-{name}/
├── .gitattributes ← Encryption rules — **/private/** wildcard
├── .gitignore
├── .git/hooks/pre-commit ← Security hook (dynamic git check-attr)
├── AGENTS.md ← Per-genome agent contract and workflow rules
├── raw/ ← Immutable sources — LLM reads, never writes
│ ├── articles/ ← Web clips, saved articles
│ ├── transcripts/ ← Audio/video transcripts
│ ├── code-packs/ ← Code snippets and repositories
│ ├── assets/ ← Images, PDFs, binary files
│ └── private/ ← AES-256-CTR encrypted — owner only
└── wiki/ ← LLM-owned — agent creates and maintains
├── index.md ← Primary catalog (read first every session)
├── log.md ← Append-only operations ledger
├── sources/ ← One page per processed raw source
├── entities/ ← People, tools, organisations, projects
├── concepts/ ← Patterns, theories, architectural decisions
├── queries/ ← Preserved answers and conflict notes
└── private/ ← AES-256-CTR encrypted — owner only
```
### Three layers
| Layer | Path | Owner | Rule |
| ----------- | ----------- | ----------- | ----------------------------------------------------- |
| Raw sources | `raw/` | Human | Immutable. LLM reads only. Never modified. |
| Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. |
| Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |
### Linked projects (optional)
A genome can optionally declare a **linked project repository** — a separate repo where
the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app
repo). The link is recorded as a third field in the registry and rendered into the
genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and
behaves exactly as before. See [Configuration](#configuration).
### Framework structure
```text
knowledge-genome-orchestrator/ ← This repository (setup tooling)
├── globals.env ← Static KEY=VALUE config (Make-includable)
├── registry.sh ← Bash-only: GENOMES array + dynamic paths
├── Makefile ← Entry point for all operations
├── lib/
│ ├── output.sh ← Terminal helpers (colors, log levels)
│ ├── deps.sh ← Dependency validation
│ ├── scaffold.sh ← Template rendering engine
│ ├── structure.sh ← Canonical genome layout (single source of truth)
│ ├── lint.sh ← Per-file validation functions
│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate)
├── providers/
│ ├── forgejo.sh ← Forgejo REST API provider
│ └── github.sh ← GitHub REST API provider
├── scripts/
│ ├── setup.sh ← Main entry point
│ ├── setup-master.sh ← Master repo initialisation
│ ├── setup-genomes.sh ← Genome provisioning loop
│ ├── add-genome.sh ← Add a single new genome
│ ├── lint-genomes.sh ← Quality control across all genomes
│ └── verify-genomes.sh ← Structure verify / --sync across all genomes
├── templates/
│ ├── agents-genome.md ← Per-genome agent contract template
│ ├── agents-master.md ← Master coordination schema template
│ ├── readme-master.md ← Master repo README template
│ ├── wiki-index.md ← Index template (rendered per genome)
│ ├── wiki-log.md ← Log template (rendered per genome)
│ ├── pr-description.md ← PR review checklist template
│ ├── pre-commit.sh ← Security hook template
│ ├── gitattributes ← Git encryption rules template
│ └── gitignore ← Git ignore template
├── skills/
│ └── ingest/ ← pi skill: deployed to the AI node (vm101)
│ ├── SKILL.md ← Semantic-only contract (read/edit, emits manifest)
│ ├── references/ ← On-demand reference docs for the agent
│ └── scripts/ ← Deterministic post-processor (runs outside the agent)
│ ├── run-ingest.sh ← Orchestrator: consumes the manifest, emits one JSON line
│ ├── slug.sh ← Slug normalisation
│ ├── index-append.py ← Sorted insert into wiki/index.md + last_updated bump
│ ├── log-append.sh ← Append a wiki/log.md entry
│ ├── scoped-lint.sh ← Lint only the pages touched this run (reuses lib/lint.sh)
│ └── open-pr.sh ← Branch / commit / push / open PR (DRY_RUN seam for tests)
└── tests/ ← bats suite — deterministic, no LLM/GPU (see Testing)
├── helpers.bash
├── scripts.bats
├── lint.bats
├── structure.bats
└── run-ingest.bats
```
> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI
> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work
> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest).
>
> ingest-semantic.py: one schema-constrained call to local model, returns JSON. run-ingest.sh: index/log/lint/PR.
> Semantic JSON extraction → deterministic wiki conform + manifest.
>
> cp skills/ingest/\* ~/.pi/agent/skills/ingest/ after make setup. Updated via git pull on laptop, pushed to vm101 via SSH in n8n flow.
---
## System Requirements
### Linux — full support (primary target)
All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+.
All tools (git-crypt, bw, qmd) have native Linux binaries.
### macOS — full support
All scripts are compatible with macOS. Requirements:
- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding).
Two things need bash 4+: the `ingest` skill (`mapfile`), which runs on the Linux AI node (not a
constraint on the macOS setup machine); and `gcrypt_rotate_key` (`compgen -G`), which **does**
run on the laptop. For key rotation on macOS, use Homebrew bash (`brew install bash`).
- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
- `git-crypt`: install via Homebrew — `brew install git-crypt`
- `jq`, `curl`: pre-installed or via Homebrew
If you use Homebrew bash (`brew install bash`), the scripts work identically to Linux.
### Windows — WSL2 only
**Git Bash and native Windows are not supported.**
Reasons:
- `git-crypt` has no native Windows binary.
- Process substitution `<(...)` used for runtime key injection is not available
in Git Bash or PowerShell.
- Several bash builtins used throughout (`compgen`, `BASH_SOURCE`, arrays) are not
available outside a POSIX-compliant shell.
**WSL2 (Windows Subsystem for Linux)** with Ubuntu gives full compatibility.
All setup and runtime operations work identically to native Linux inside WSL2.
### Hardware recommendations
The system is designed for a homelab architecture:
| Component | Recommended | Role |
| --------------- | ------------------------- | --------------------------------------------------------------- |
| Storage node | Any Linux server with NFS | Hosts Forgejo, stores genome repos |
| AI compute node | GPU server (16GB+ VRAM) | Runs local LLM agent sessions |
| VRAM | 16GB minimum | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache |
| Local LLM | 14B32B quantised | Active wiki maintenance sessions |
| Large LLM | 70B (async) | Deep reflection, complex synthesis (scheduled, not interactive) |
> **On VRAM constraints:** with a 16GB card and a 14B model, the KV cache budget
> is ~6GB — approximately 32k tokens of effective context. Every token in `AGENTS.md`,
> the index, and the log tail is a cost. This is why all agent files are token-optimised
> and sessions are kept to one source at a time.
> **Reference deployment:** the table above is a target profile, not a hard requirement.
> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive
> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they
> just make the "one source per session" discipline and the token budget matter more.
---
## Prerequisites
### Required
| Tool | Purpose |
| ----------- | -------------------------------- |
| `git` | Version control |
| `git-crypt` | Transparent file encryption |
| `curl` | REST API calls to Forgejo/GitHub |
| `jq` | JSON parsing |
### Optional
| Tool | Purpose |
| ----- | ----------------------------------------------------------------------- |
| `bw` | Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) |
| `qmd` | Local BM25 + vector search for Markdown files with MCP server interface |
> **`bw` vs `bws`:** Use `bw` (standard Bitwarden CLI). `bws` is the Bitwarden
> Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.
### Install on Ubuntu/Debian
```bash
sudo apt update && sudo apt install -y git git-crypt curl jq
```
### Install on macOS
```bash
brew install git git-crypt curl jq
```
### Install Bitwarden CLI
```bash
# Linux
npm install -g @bitwarden/cli
# macOS
brew install bitwarden-cli
```
### Verify all tools
```bash
make doctor
```
---
## Configuration
Configuration is split into two files with distinct purposes:
### `globals.env` — static KEY=VALUE
Safe for `make include`, `docker-compose`, shell `source`, and any standard env parser.
Contains only simple scalar values — no bash syntax, no arrays.
```bash
# Provider selection
PROVIDER=forgejo # forgejo | github
# Forgejo (active when PROVIDER=forgejo)
FORGEJO_URL=https://git.yourserver.com
FORGEJO_USER=yourusername
FORGEJO_SSH_PORT=222 # Default for many homelab Forgejo setups; 22 for standard
# GitHub (active when PROVIDER=github — uncomment to use)
# GITHUB_USER=your-username
# GITHUB_ORG=your-org # Optional: for org repos, overrides GITHUB_USER
# Vaultwarden
VAULTWARDEN_URL=https://vault.yourserver.com
# Master repository
MASTER_REPO=master-knowledge-genome
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git
```
### `registry.sh` — bash runtime config
Sourced by shell scripts only. Contains the genome registry array and dynamic path
resolution. Never included by Make.
```bash
# Dynamic paths (resolved at source time)
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
KEYS_DIR="${WORK_DIR}/keys"
# Genome registry — format: "name|description|linked_repo"
# The third and fourth fields are OPTIONAL:
# - leave it empty → knowledge-only genome (no linked project)
# - owner/repo → genome is linked to that project repository (rendered into AGENTS.md)
# - cross_source → yes|no (default no): whether the cross-genome collector may read this genome as a source
GENOMES=(
"genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app|no"
"genome-finance|Personal finance, investments, market analysis||no"
"genome-homelab|Infrastructure, network configs, architecture logs||no"
)
```
To add a genome to the registry before running setup, append a line to `GENOMES`.
After initial setup, use `make add-genome` instead.
### Tokens
Tokens are never stored in config files. Export them in your shell before running setup:
```bash
export FORGEJO_TOKEN="your_forgejo_token"
# or
export GITHUB_TOKEN="your_github_token"
```
---
## Quick Start
```bash
# 1. Clone the setup framework
git clone <setup-repo-url> knowledge-genome-orchestrator
cd knowledge-genome-orchestrator
# 2. Configure your environment
cp globals.env.example globals.env # edit with your values
# Edit registry.sh to define your genomes
# 3. Export your provider token
export FORGEJO_TOKEN="your_token_here"
# 4. Verify dependencies
make doctor
# 5. Run full setup
make setup
```
`make setup` executes in order:
1. **Dependency check** — verifies all required tools are installed
2. **Git identity check** — warns if `user.name` / `user.email` are not configured
3. **Master repo** — creates `master-knowledge-genome` on Forgejo, scaffolds with
`AGENTS.md` and `README.md`, initialises git, adds `core-karpathy` as submodule, pushes
4. **Genome provisioning** — for each genome in `registry.sh`:
- Creates remote repository on Forgejo
- Adds it as a submodule in the master repo
- Initialises git-crypt (**before any files are created**)
- Scaffolds directory structure and renders all templates
- Installs pre-commit security hook
- Commits, pushes genome to remote
- Exports symmetric key to `keys/<genome>.key`
- Prints Vaultwarden upload instructions
- Commits submodule pointer in master repo
After setup completes:
- Upload all files in `keys/` to Vaultwarden (see Key Management)
- Delete key files from disk: `rm keys/*.key`
---
## Makefile Reference
| Target | Description |
| ----------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` |
| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project) |
| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) |
| `make verify-structure` | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`) |
| `make sync-structure` | Create any missing canonical directories across all genomes (safe, idempotent) |
| `make test` | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) |
| `make status` | Show submodule status and per-genome git-crypt encryption state |
| `make lock` | Lock all encrypted repos (master + all genome submodules) |
| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome |
| `make help` | Print all available targets |
### Examples
```bash
# Check system health
make doctor
# Add a new genome after initial setup
make add-genome NAME=genome-research DESC="Academic papers and deep research"
# Add a genome linked to a project repository
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app
# Check every genome against the canonical directory layout
make verify-structure
# Run full lint pass (bash deterministic checks)
make lint
# Sync all nodes after pulling on another machine
make sync
# Emergency lock — secures all repos before leaving a session
make lock
```
---
## Testing
The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is
covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are
**deterministic and have zero dependency on the LLM, the GPU, or the network** — they
simulate the agent's output with fixtures and exercise the scripts directly, so they run
anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI
node or via n8n.
```bash
sudo apt install bats # once
make test # or: bats tests/
```
| File | Covers |
| ----------------- | ------------------------------------------------------------------------------ |
| `scripts.bats` | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) |
| `lint.bats` | `lib/lint.sh` validators + `scoped-lint.sh` |
| `structure.bats` | `lib/structure.sh` report / sync |
| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq` |
Each test builds its own throwaway genome with a local bare remote, configured to ignore
the operator's global git settings (signing, global hooks) so the suite is hermetic. The
`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in
`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match.
> Why this matters: the only non-deterministic part of the system is the model. Pinning
> the mechanical layer with tests means that when an ingest misbehaves, you know it's the
> model or the prompt — not the plumbing.
---
## Genome Lifecycle
### Initial setup
All genomes defined in `registry.sh` are provisioned by `make setup`.
### Adding a genome after initial setup
```bash
make add-genome NAME=genome-newname DESC="Domain description"
```
This: creates the remote repo, adds it as a submodule, initialises git-crypt,
scaffolds the directory structure, installs the pre-commit hook, commits and pushes,
exports the key, and commits the submodule pointer in master.
After adding: upload the new key to Vaultwarden and delete the key file.
### Removing a genome
Manual process:
```bash
# In master repo
git submodule deinit genome-name
git rm genome-name
git commit -m "chore: remove genome-name submodule"
git push
# Archive or delete the remote repository on Forgejo
```
### Template rendering
When a genome is scaffolded, `render_template` replaces these placeholders in all
template files:
| Placeholder | Source | Example |
| ----------------------- | ----------- | ------------------------------ |
| `{{GENOME_NAME}}` | registry.sh | `genome-dev` |
| `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` |
| `{{GENOME_DESC}}` | registry.sh | `Web development...` |
| `{{LINKED_PROJECT}}` | registry.sh | `myorg/my-app` (or `none`) |
| `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` |
| `{{FORGEJO_USER}}` | globals.env | `yourusername` |
| `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` |
| `{{MASTER_REPO}}` | globals.env | `master-knowledge-genome` |
| `{{DATE}}` | runtime | `2026-05-11` |
---
## Security Model
### Encryption architecture
Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt.
Two directories in every genome are always encrypted:
| Directory | Contents | On remote |
| --------------- | --------------------------- | ------------------ |
| `raw/private/` | Sensitive source material | Opaque binary blob |
| `wiki/private/` | Private synthesis and notes | Opaque binary blob |
All other directories (`raw/articles/`, `wiki/sources/`, etc.) are plaintext.
Collaborators without the key can contribute to public directories normally —
git handles encrypted files transparently.
### `.gitattributes` — dynamic encryption rules
Encryption rules use a glob wildcard that catches any `private/` directory at
any depth in the repository — including directories created at runtime by the LLM:
```gitattributes
# Text rules first
*.md text eol=lf
*.sh text eol=lf
# Encryption rules LAST (later rules override per-attribute)
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
**/private/** filter=git-crypt diff=git-crypt -text
```
> Rule ordering matters: in `.gitattributes`, the last matching rule wins per attribute.
> Encryption rules must come after text rules so `-text` overrides `text eol=lf`
> for encrypted markdown files.
### Pre-commit hook — dynamic validation
The security hook installed at `.git/hooks/pre-commit` validates every staged file
dynamically — it reads encryption requirements from `.gitattributes` at runtime
rather than checking hardcoded paths:
```bash
# For each staged file, check if git-crypt encryption is required
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
if [[ "$filter" == "git-crypt" ]]; then
# Verify the file is actually encrypted
if git-crypt status "$file" | grep -q "not encrypted"; then
# BLOCK THE COMMIT
fi
fi
```
This means: any file matching `**/private/**` in `.gitattributes` is protected,
including future `private/` directories created anywhere in the repo.
The hook never needs updating when the encryption rules change.
### Untrusted agent output — manifest validation
The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
**validates the manifest before trusting any field** — it must be well-formed JSON with a
string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
knowledge tree. This is the trust boundary between the (stochastic) model and the
(deterministic, tested) post-processor.
### PRIVATE_CONTEXT toggle
The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
accesses encrypted directories. It must be declared explicitly by the operator
at the start of every session:
```text
PRIVATE_CONTEXT: disabled ← Default. private/ directories are treated as non-existent.
PRIVATE_CONTEXT: enabled ← Agent may read/write private/. Requires git-crypt unlock.
```
Rules:
- Never inferred. Never carried over from a previous session.
- `enabled` requires the operator to confirm that `git-crypt unlock` has run on the host.
- Per-genome, per-session: enabling for `genome-finance` does NOT enable for `genome-dev`.
- Cloud LLM models: `PRIVATE_CONTEXT` must always be `disabled`. Private data never leaves the local network.
- All outputs derived from private data are prefixed `[PRIVATE DATA INCLUDED]`.
- Private synthesis goes exclusively to `wiki/private/` — never to public wiki paths.
### Runtime key injection — zero disk policy
Encryption keys are never stored as persistent files on the AI server.
They are injected at session start via the Bitwarden CLI (`bw`) against
your self-hosted Vaultwarden instance, using process substitution:
```bash
# Step 1: authenticate
bw config server https://vault.yourserver.com
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
git-crypt unlock <(
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)
```
The key flows: Vaultwarden → `bw get notes``base64 -d` → kernel pipe → `git-crypt`.
At no point is the key written to any file on disk.
Lock a genome when the session ends:
```bash
git-crypt lock
```
---
## Key Management
> This section is for the operator. These commands are never issued by the LLM agent.
### Vaultwarden Secure Notes
Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:
| Genome | Vaultwarden Note Name |
| ---------------- | --------------------- |
| `genome-dev` | `genome-dev key` |
| `genome-finance` | `genome-finance key` |
| `genome-homelab` | `genome-homelab key` |
After `make setup` or `make add-genome`, key files are exported to `keys/`.
Upload procedure:
```bash
# Encode the key
base64 < keys/genome-dev.key
# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
# Then delete the key file
rm keys/genome-dev.key
```
### Cloning on a new machine
```bash
# Full clone with all submodules
git clone --recurse-submodules \
https://git.yourserver.com/yourusername/master-knowledge-genome.git
# Unlock a specific genome (with key file — development only)
cd master-knowledge-genome/genome-dev
git-crypt unlock /path/to/genome-dev.key
# Unlock via Vaultwarden (recommended — no key on disk)
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)
# Sparse clone — collaborator who only needs one genome
git clone https://git.yourserver.com/yourusername/genome-dev.git
```
### Key rotation (emergency)
If a key is lost or compromised:
```bash
# From the knowledge-genome-orchestrator/ directory
source lib/git-crypt.sh
# If gcrypt_rotate_key operates on the CWD: cd into .../master-knowledge-genome/genome-dev
# If it navigates by name instead: cd into .../master-knowledge-genome
cd ~/knowledge-genome-orchestrator/master-knowledge-genome
gcrypt_rotate_key "genome-dev"
```
> **macOS:** `gcrypt_rotate_key` uses `compgen -G` (bash 4+). The stock macOS bash 3.2 is not
> enough — run rotation under Homebrew bash (`brew install bash`).
`gcrypt_rotate_key` performs:
1. Unlocks repo with existing key
2. Removes old key material
3. Generates new symmetric key via `git-crypt init`
4. Re-stages and commits private files (encrypted with new key)
5. Exports new key to `keys/`
6. Prints Vaultwarden update instructions
> **Limitation:** git history still contains blobs encrypted with the old key.
> Anyone with the old key and git history access can decrypt them. To purge old
> encrypted blobs from history:
>
> ```bash
> git filter-repo --invert-paths --path raw/private --path wiki/private
> git push --force origin main
> ```
>
> This rewrites all commit hashes — coordinate with any collaborators first.
After rotation:
- Upload new key to Vaultwarden (replace existing note)
- Delete both `keys/genome-dev.key` and `keys/genome-dev-rotated-*.key` from disk
- Revoke access from previous key holders
---
## Agent Sessions
### Prerequisites for every session
Before starting an LLM agent session on a genome:
1. The host (AI server) runs `git-crypt unlock` for the required genomes
2. The orchestrator prepares context: `tail -n 20 wiki/log.md`
3. Declare `PRIVATE_CONTEXT` state explicitly in the opening prompt
### Session start protocol
The agent executes in this order at the start of every session:
1. Read `wiki/index.md` — primary catalog of all pages and maturity
2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
3. For tasks involving related pages: if the optional `qmd` extension is installed,
`qmd search "<query>"` before opening files; otherwise navigate from `wiki/index.md`
4. Operate on individual files — never scan entire directories
### One source per session
With a 14B model and ~6GB KV cache budget, long sessions degrade.
As the session extends, the context fills with pages already created,
attention dilutes, and later entities receive worse cross-references than earlier ones.
**Hard rule: one source per session.**
If multiple sources are queued in `raw/`, process only the first.
Commit, close the session. The orchestrator (n8n or script) starts a new session
for the next source with a clean KV cache.
For automated pipelines: if 5 files arrive in `raw/`, trigger 5 agent sessions
sequentially — not one session with 5 files.
### n8n automation
For Forgejo webhook → automated ingest:
1. Forgejo sends webhook on push to `raw/`
2. n8n receives webhook, identifies new files
3. n8n starts one agent session per new file (sequential, not parallel)
4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
6. Human reviews — **merge to accept**, or close the PR + delete the `feat` branch to reject
---
## Workflows
### Ingest
Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two
phases so that the small local model spends its limited context only on judgement, and
all the deterministic bookkeeping happens outside the model's loop.
**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools
only (no shell). It:
1. Reads the source once
2. Creates `wiki/sources/<slug>.md` — summary and key points
3. Per entity (person, tool, organisation): creates or updates `wiki/entities/<name>.md`
4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/<name>.md`
5. Checks each touched page for contradictions → applies Conflict Resolution if found
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
a one-line reasoning, the PR summary, and any contradictions) — then **stops**
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
**validates the manifest** — well-formed JSON, expected shape, and every page path confined to
`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
model must not waste context on:
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
index `last_updated` (`index-append.py`)
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
`lib/lint.sh`), including a **duplicate-slug advisory**: a slug created this run that is
highly similar to an entity/concept already in `wiki/index.md` is flagged in the PR so a
human can merge them. It is advisory only — it never fails the lint or blocks the PR
(threshold tunable via `KG_DUP_THRESHOLD`, default 70)
10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
structure (Summary / Pages / Contradictions / Scoped Lint)
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
The agent never runs git, never edits the index/log mechanically, and never lints — those
are deterministic and tested (see [Testing](#testing)). Invocation on the AI node:
```bash
pi --mode json -p "/skill:ingest raw/articles/<file>.md" # phase 1 → writes manifest
run-ingest.sh <genome> # phase 2 → index/log/lint/PR
```
For private sources (`PRIVATE_CONTEXT: enabled` required):
- All output goes to `wiki/private/<slug>.md` only
- PR title: `[PRIVATE] ingest: <slug>`
**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
the orchestrator, around the human gate:
- **Before each session** the orchestrator realigns the checkout to the base
(`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
checkout to match the remote, never a force-push to the shared branch.
- **After the PR opens, everything stops** until a human approves: one source per session,
sequential, no new ingest until the pending PR is closed.
- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
shared branch.
The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
buffer ingests on `develop` and cut manual `develop → main` releases — no code change.
### Query
Triggered by an operator question.
1. `qmd search "<query>"` (if the optional qmd extension is installed) → identify
candidate pages; otherwise start from `wiki/index.md`
2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
3. Synthesise answer with `[[wikilink]]` citations
4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
5. Append log entry: `QUERY | <subject>`
For general orientation without a specific query: read `wiki/index.md` directly.
### Lint
The lint workflow is split between deterministic bash checks and semantic LLM judgment.
**Step 1 — operator runs bash linter:**
```bash
make lint
```
The bash linter checks automatically:
- YAML frontmatter validity (all mandatory fields present)
- Domain consistency (domain field matches genome name)
- Type validity (value from allowed list)
- Privacy consistency (`private/` directories have `private: true`)
- Page size (warn at 400 lines, error at 800 lines)
- Knowledge decay (stable > 180 days, draft > 90 days)
- Broken internal wikilinks (warnings only — cross-type links produce expected false positives)
**Step 2 — operator provides bash output to LLM agent:**
The agent applies semantic judgment to findings the bash linter cannot make:
- **Orphan pages** (from bash list): for each orphan, identify 1-3 existing pages
that should link to it; propose specific additions
- **Implicit concepts** (from bash term frequency list): determine if a candidate
term warrants a dedicated page; draft stub if yes
- **Duplicate concepts**: `qmd search "<concept>"` for suspected duplicates;
propose merge if confirmed
- **Maturity promotion**: pages with 2+ sources still marked `draft` → propose `stable`
The agent reports all findings as a structured list. It does not modify files
without operator approval. Appends `LINT | <summary>` log entry.
---
## Knowledge Quality
### PR review workflow
Every agent session that modifies wiki pages opens a PR.
The PR description uses `templates/pr-description.md`:
```markdown
## Summary
One sentence: goal of this session and source processed.
## Pages Created
| Path | Type | Maturity |
## Pages Modified
| Path | Change |
## Contradictions Found
[ ] None / [ ] n conflict file(s) created
## Private Data Accessed
[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes
## Scoped Lint (post-ingest)
[ ] Frontmatter valid [ ] No broken links [ ] No issues found
```
This makes human review fast and structured: read the table, scan the diff,
approve or request changes. No exploration required to understand what the agent did.
### Conflict resolution
When new evidence contradicts an existing wiki claim:
1. Keep the existing page unchanged
2. Create `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md` with:
- The existing claim and its source
- The contradicting evidence and its source
- Agent confidence assessment for each
- Recommendation: `accept_b` | `keep_a` | `requires_human_review`
3. Add entry to `wiki/index.md` → Conflicts Pending Review section
4. Log entry: `CONFLICT | <concept>`
5. Open PR: `[CONFLICT] <concept> — human review required`
The operator resolves the conflict, updates relevant pages, closes the PR.
### Knowledge decay
Pages have a `last_updated` field in frontmatter. During lint passes:
| Maturity | Threshold | Action |
| -------- | --------- | -------------------------------------- |
| `stable` | 180 days | Flag as stale — add `⚠️ STALE` callout |
| `draft` | 90 days | Flag as stale — add `⚠️ STALE` callout |
The agent proposes re-validation but does not change `maturity` without new source evidence.
### Cross-genome references
> **Status: planned.** The cross-genome collector and **navigation skill** described in this
> section are specified but **not yet implemented** in this release — only the `ingest` skill
> ships today. What follows documents the intended design and the boundary contract it will honour.
Cross-domain knowledge moves by **pull, never push**: the genome you are working in draws
material _in_; nothing is ever written into another genome. There are **no cross-genome
wikilinks** — submodule pointers make relative paths brittle.
When the working genome needs a concept that lives elsewhere, the **navigation skill** handles
it in the same two-phase shape as ingest:
1. A deterministic collector clones the relevant genomes **read-only at HEAD** (fresh — never the
pinned submodule state) and assembles a dossier of excerpts with provenance.
2. A semantic pass reads only that dossier; the skill then deposits **one** abstract, non-private
raw into the working genome at `raw/articles/crossgen-<topic>-<date>.md`.
3. That raw goes through the working genome's normal ingest → PR → human gate, like any source.
Which genomes may be read as **sources** is gated by a per-genome `cross_source: yes|no` flag: a
confidential genome (e.g. a client file) is marked `no` and is never read as a source — the wall
is structural, not a matter of the agent's discipline. The master `AGENTS.md` holds the full
boundary contract.
---
## Knowledge Schema
### Frontmatter
Every wiki page must start with valid YAML frontmatter:
```yaml
---
title: "Strict String Title"
type: source | entity | concept | query | conflict | private
domain: genome-name
tags: [lowercase, hyphen-separated]
maturity: draft | stable | deprecated
last_updated: YYYY-MM-DD
private: true | false
---
```
| Field | Rules |
| ---------------------- | ------------------------------------------------------------------------ |
| `type` | Must be one of: `source entity concept query conflict private index log` |
| `maturity: draft` | Single source or unvalidated |
| `maturity: stable` | Confirmed by 2+ independent sources |
| `maturity: deprecated` | Superseded — add `> **DEPRECATED:** <reason>` callout at top |
| `private: true` | Required on all pages in `wiki/private/` and `raw/private/` |
Do not use semantic versioning for content. Git history tracks every change.
`maturity` captures epistemic state; `last_updated` tracks recency.
### Page types and directories
| Type | Directory | Description |
| ---------- | ---------------------------- | -------------------------------------------- |
| `source` | `wiki/sources/` | One page per processed raw source |
| `entity` | `wiki/entities/` | People, tools, organisations, projects |
| `concept` | `wiki/concepts/` | Patterns, theories, architectural decisions |
| `query` | `wiki/queries/` | Preserved answers and analyses |
| `conflict` | `wiki/queries/conflict-*.md` | Unresolved contradictions |
| `private` | `wiki/private/` | Private synthesis (PRIVATE_CONTEXT: enabled) |
| `index` | `wiki/index.md` | Primary navigation catalog (singleton) |
| `log` | `wiki/log.md` | Operations ledger (singleton) |
### Page size limits
| Limit | Lines | Action |
| -------- | ----- | ----------------------------------- |
| Soft cap | 400 | Bash linter warns |
| Hard cap | 800 | Bash linter errors — split the page |
These limits ensure pages fit within the LLM context window without attention degradation
and keep the wiki atomically navigable.
### Linking conventions
- **Intra-genome:** `[[folder/file]]` — Obsidian wikilinks only.
- **Cross-genome:** NOT supported via wikilink — submodule pointers make relative paths brittle. When the working genome needs a concept that lives elsewhere, the navigation skill **pulls it in** as one abstract raw under _this_ genome's `raw/articles/`, which then goes through normal ingest. See [Cross-genome references](#cross-genome-references).
- **External:** `[text](https://...)` — standard Markdown.
### Log format
Every operation appends one entry to `wiki/log.md`:
```markdown
## [YYYY-MM-DD] TYPE | Subject
- run_id: `<uuid>`
- model: `<model-name>`
- context_read: `[[path/A]]`, `[[path/B]]`
- output_written: `[[path/C]]`
- reasoning: One sentence — what changed and why.
```
Valid TYPEs: `INGEST` `LINT` `QUERY` `CONFLICT` `CONFIG` `SECURITY`
Parse examples:
```bash
grep "^## \[" wiki/log.md | tail -5 # Last 5 entries
grep "^## \[" wiki/log.md | grep "CONFLICT" # All conflicts
grep "^## \[2026-05" wiki/log.md # Entries from a specific month
```
ingest-semantic.py receives source text + existing entity/concept names (from index) as prompt context.
The LLM never loads the full log.
---
## Collaboration Model
| Role | Key access | Permitted operations |
| -------------- | ----------------- | ----------------------------------------------------------------------------- |
| Owner | Full — key holder | Read/write everywhere |
| Collaborator | None | Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/` |
| Local AI agent | Conditional | `private/` only when `PRIVATE_CONTEXT: enabled` |
| Cloud AI model | Never | `PRIVATE_CONTEXT` must be `disabled`; private data stays on local network |
Grant collaborator access: add as Forgejo contributor with Write role.
Never share the git-crypt key — collaborators operate exclusively in public directories.
---
## Optional Extensions
### qmd — local Markdown search
[qmd](https://github.com/tobi/qmd) is a local, on-device BM25 + vector search
engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls)
and an MCP server (for native LLM tool use).
Recommended at scale: once a genome exceeds ~150 pages, `qmd search` is significantly
faster and more accurate than navigating `wiki/index.md` manually.
```bash
# Index a genome
qmd index genome-dev/wiki/
# Search
qmd search "graph-based state management"
# Start MCP server (for Claude Code / Codex integration)
qmd serve --port 3333
```
### Obsidian integration
Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.
Recommended setup:
- **Graph view** — visualise page connections; spot orphans and hubs instantly
- **Obsidian Web Clipper** — browser extension to clip articles directly to `raw/articles/`
as Markdown
- **Download attachments** — Settings → Hotkeys → "Download attachments for current file".
Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to `raw/assets/`
- **Dataview plugin** — query YAML frontmatter across the wiki;
`TABLE maturity, last_updated WHERE domain = "genome-dev"` generates dynamic tables
- **Marp plugin** — render Markdown as slide decks directly from wiki content
Note: `.obsidian/` is in `.gitignore`. Workspace and plugin settings are local — not synced.
### n8n automation
n8n → SSH → ingest-semantic.py <genome> <raw> → run-ingest.sh <genome>.
n8n (running on the storage node) can automate the ingest pipeline:
1. Forgejo webhook fires on push to a genome's `raw/` directory
2. n8n flow identifies new files
3. For each new file: starts one agent session (sequential — never parallel)
4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 —
`run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n
6. Human reviews the PR
Key constraint: one source per session, sessions sequential.
Never batch multiple sources into one agent session.
### Intel NPU offloading
If the AI compute node has an Intel NPU (e.g. Core Ultra series):
- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd
re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO
- Active reasoning sessions (ingest, query, synthesis) → GPU
Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)),
so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the
GPU's KV cache free for interactive sessions and lowers power draw for background jobs.
---
## Troubleshooting
### `git-crypt: command not found`
```bash
# Ubuntu/Debian
sudo apt install git-crypt
# macOS
brew install git-crypt
```
### `make setup` fails with "MISSING: jq"
```bash
make doctor # identifies all missing tools
sudo apt install git git-crypt curl jq
```
### Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"
The staged file is in a path matching `**/private/**` but is not encrypted.
Fix options:
1. Verify `.gitattributes` contains `**/private/** filter=git-crypt diff=git-crypt -text`
2. Run `git-crypt init` if git-crypt is not initialised in this repo
3. Run `git-crypt status` to check the encryption state of all files
Never use `git commit --no-verify` to bypass this check.
### `git-crypt status` shows files as "not encrypted" after init
The `.gitattributes` rule must be committed before files in `private/` are staged.
If files were staged before `.gitattributes` was committed:
```bash
git rm -r --cached raw/private/ wiki/private/
git add raw/private/ wiki/private/
git commit -m "fix: re-stage private files for encryption"
```
### Agent returns stale or missing cross-references
Likely causes:
1. Session was too long — KV cache degraded. Use one source per session.
2. `wiki/index.md` was not read at session start — agent lacked the page catalog.
3. qmd index is stale — re-index: `qmd index <genome>/wiki/`
### Submodules show as "modified" after `make sync`
This is normal if genome repos have new commits. Update master's pointers:
```bash
cd master-knowledge-genome
git add .
git commit -m "chore: update submodule pointers"
git push
```
### bw unlock fails
Verify you are using `bw` (standard Bitwarden CLI), not `bws` (Secrets Manager CLI).
`bws` does not work with self-hosted Vaultwarden.
```bash
bw --version # should print e.g. "2024.x.x"
bw config server https://vault.yourserver.com
bw login
```