knowledge-genome-orchestrator/README.md

# Knowledge Genome System

> A distributed, encrypted, multi-domain personal knowledge base.
> No vector database. No embedding pipeline. No external retrieval server.

Built on the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
by Andrej Karpathy — extended with a multi-domain submodule architecture,
AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection,
and a human-in-the-loop Git Flow for quality control.

---

## Table of Contents

1. [Core Philosophy](#core-philosophy)
2. [Architecture](#architecture)
3. [System Requirements](#system-requirements)
4. [Prerequisites](#prerequisites)
5. [Configuration](#configuration)
6. [Quick Start](#quick-start)
7. [Makefile Reference](#makefile-reference)
8. [Testing](#testing)
9. [Genome Lifecycle](#genome-lifecycle)
10. [Security Model](#security-model)
11. [Key Management](#key-management)
12. [Agent Sessions](#agent-sessions)
13. [Workflows](#workflows)
14. [Knowledge Quality](#knowledge-quality)
15. [Knowledge Schema](#knowledge-schema)
16. [Collaboration Model](#collaboration-model)
17. [Optional Extensions](#optional-extensions)
18. [Troubleshooting](#troubleshooting)

---

## Core Philosophy

Most RAG systems make the LLM rediscover knowledge from scratch on every query.
A document is indexed; at query time, relevant chunks are retrieved; an answer is generated.
Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM
pieces it together from fragments every single time.

This system is different. Instead of retrieval at query time, the LLM
**incrementally builds and maintains a persistent wiki** that sits between you and the raw
sources. When a new source arrives, the LLM reads it, extracts key information, updates
entity and concept pages, flags contradictions with existing claims, and strengthens the
evolving synthesis. Knowledge is compiled once and kept current.

**The wiki is a compounding artifact.** Cross-references are already there.
Contradictions have been flagged. The synthesis already reflects everything ingested.

This means:

- No vector database.
- No embedding pipeline.
- No external retrieval infrastructure.

The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
(~100 sources, hundreds of pages) this performs better than RAG because cross-references,
contradictions, and syntheses are already resolved — not re-derived per query.

The human's job: curate sources, direct analysis, ask good questions, review PRs.
The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.

---

## Architecture

### Repository structure

```text
master-knowledge-genome/              ← Root orchestrator (submodule registry)
├── core-karpathy/                    ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/                       ← Submodule: web development, Angular, TUI
├── genome-finance/                   ← Submodule: personal finance, investments
├── genome-homelab/                   ← Submodule: Keru infrastructure, network configs
└── AGENTS.md                         ← Global coordination schema (cross-genome rules)
```

Each genome is an independent git repository:

```text
genome-{name}/
├── .gitattributes                    ← Encryption rules — **/private/** wildcard
├── .gitignore
├── .git/hooks/pre-commit             ← Security hook (dynamic git check-attr)
├── AGENTS.md                         ← Per-genome agent contract and workflow rules
│
├── raw/                              ← Immutable sources — LLM reads, never writes
│   ├── articles/                     ← Web clips, saved articles
│   ├── transcripts/                  ← Audio/video transcripts
│   ├── code-packs/                   ← Code snippets and repositories
│   ├── assets/                       ← Images, PDFs, binary files
│   └── private/                      ← AES-256-CTR encrypted — owner only
│
└── wiki/                             ← LLM-owned — agent creates and maintains
    ├── index.md                      ← Primary catalog (read first every session)
    ├── log.md                        ← Append-only operations ledger
    ├── sources/                      ← One page per processed raw source
    ├── entities/                     ← People, tools, organisations, projects
    ├── concepts/                     ← Patterns, theories, architectural decisions
    ├── queries/                      ← Preserved answers and conflict notes
    └── private/                      ← AES-256-CTR encrypted — owner only
```

### Three layers

| Layer       | Path        | Owner       | Rule                                                  |
| ----------- | ----------- | ----------- | ----------------------------------------------------- |
| Raw sources | `raw/`      | Human       | Immutable. LLM reads only. Never modified.            |
| Wiki        | `wiki/`     | LLM         | Agent creates, updates, cross-links, maintains.       |
| Schema      | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |

### Linked projects (optional)

A genome can optionally declare a **linked project repository** — a separate repo where
the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app
repo). The link is recorded as a third field in the registry and rendered into the
genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and
behaves exactly as before. See [Configuration](#configuration).

### Framework structure

```text
knowledge-genome-orchestrator/        ← This repository (setup tooling)
├── globals.env                       ← Static KEY=VALUE config (Make-includable)
├── registry.sh                       ← Bash-only: GENOMES array + dynamic paths
├── Makefile                          ← Entry point for all operations
├── lib/
│   ├── output.sh                     ← Terminal helpers (colors, log levels)
│   ├── deps.sh                       ← Dependency validation
│   ├── scaffold.sh                   ← Template rendering engine
│   ├── structure.sh                  ← Canonical genome layout (single source of truth)
│   ├── lint.sh                       ← Per-file validation functions
│   └── git-crypt.sh                  ← git-crypt lifecycle (init, export, verify, rotate)
├── providers/
│   ├── forgejo.sh                    ← Forgejo REST API provider
│   └── github.sh                     ← GitHub REST API provider
├── scripts/
│   ├── setup.sh                      ← Main entry point
│   ├── setup-master.sh               ← Master repo initialisation
│   ├── setup-genomes.sh              ← Genome provisioning loop
│   ├── add-genome.sh                 ← Add a single new genome
│   ├── lint-genomes.sh               ← Quality control across all genomes
│   └── verify-genomes.sh             ← Structure verify / --sync across all genomes
├── templates/
│   ├── agents-genome.md              ← Per-genome agent contract template
│   ├── agents-master.md              ← Master coordination schema template
│   ├── readme-master.md              ← Master repo README template
│   ├── wiki-index.md                 ← Index template (rendered per genome)
│   ├── wiki-log.md                   ← Log template (rendered per genome)
│   ├── pr-description.md             ← PR review checklist template
│   ├── pre-commit.sh                 ← Security hook template
│   ├── gitattributes                 ← Git encryption rules template
│   └── gitignore                     ← Git ignore template
├── skills/
│   └── ingest/                       ← pi skill: deployed to the AI node (vm101)
│       ├── SKILL.md                  ← Semantic-only contract (read/edit, emits manifest)
│       ├── references/               ← On-demand reference docs for the agent
│       └── scripts/                  ← Deterministic post-processor (runs outside the agent)
│           ├── run-ingest.sh         ← Orchestrator: consumes the manifest, emits one JSON line
│           ├── slug.sh               ← Slug normalisation
│           ├── index-append.py       ← Sorted insert into wiki/index.md + last_updated bump
│           ├── log-append.sh         ← Append a wiki/log.md entry
│           ├── scoped-lint.sh        ← Lint only the pages touched this run (reuses lib/lint.sh)
│           └── open-pr.sh            ← Branch / commit / push / open PR (DRY_RUN seam for tests)
└── tests/                            ← bats suite — deterministic, no LLM/GPU (see Testing)
    ├── helpers.bash
    ├── scripts.bats
    ├── lint.bats
    ├── structure.bats
    └── run-ingest.bats
```

> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI
> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work
> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest).

---

## System Requirements

### Linux — full support (primary target)

All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+.
All tools (git-crypt, bw, qmd) have native Linux binaries.

### macOS — full support

All scripts are compatible with macOS. Requirements:

- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding).
  Two things need bash 4+: the `ingest` skill (`mapfile`), which runs on the Linux AI node (not a
  constraint on the macOS setup machine); and `gcrypt_rotate_key` (`compgen -G`), which **does**
  run on the laptop. For key rotation on macOS, use Homebrew bash (`brew install bash`).
- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
- `git-crypt`: install via Homebrew — `brew install git-crypt`
- `jq`, `curl`: pre-installed or via Homebrew

If you use Homebrew bash (`brew install bash`), the scripts work identically to Linux.

### Windows — WSL2 only

**Git Bash and native Windows are not supported.**

Reasons:

- `git-crypt` has no native Windows binary.
- Process substitution `<(...)` used for runtime key injection is not available
  in Git Bash or PowerShell.
- Several bash builtins used throughout (`compgen`, `BASH_SOURCE`, arrays) are not
  available outside a POSIX-compliant shell.

**WSL2 (Windows Subsystem for Linux)** with Ubuntu gives full compatibility.
All setup and runtime operations work identically to native Linux inside WSL2.

### Hardware recommendations

The system is designed for a homelab architecture:

| Component       | Recommended               | Role                                                            |
| --------------- | ------------------------- | --------------------------------------------------------------- |
| Storage node    | Any Linux server with NFS | Hosts Forgejo, stores genome repos                              |
| AI compute node | GPU server (16GB+ VRAM)   | Runs local LLM agent sessions                                   |
| VRAM            | 16GB minimum              | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache           |
| Local LLM       | 14B–32B quantised         | Active wiki maintenance sessions                                |
| Large LLM       | 70B (async)               | Deep reflection, complex synthesis (scheduled, not interactive) |

> **On VRAM constraints:** with a 16GB card and a 14B model, the KV cache budget
> is ~6GB — approximately 32k tokens of effective context. Every token in `AGENTS.md`,
> the index, and the log tail is a cost. This is why all agent files are token-optimised
> and sessions are kept to one source at a time.

> **Reference deployment:** the table above is a target profile, not a hard requirement.
> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive
> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they
> just make the "one source per session" discipline and the token budget matter more.

---

## Prerequisites

### Required

| Tool        | Purpose                          |
| ----------- | -------------------------------- |
| `git`       | Version control                  |
| `git-crypt` | Transparent file encryption      |
| `curl`      | REST API calls to Forgejo/GitHub |
| `jq`        | JSON parsing                     |

### Optional

| Tool  | Purpose                                                                 |
| ----- | ----------------------------------------------------------------------- |
| `bw`  | Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) |
| `qmd` | Local BM25 + vector search for Markdown files with MCP server interface |

> **`bw` vs `bws`:** Use `bw` (standard Bitwarden CLI). `bws` is the Bitwarden
> Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.

### Install on Ubuntu/Debian

```bash
sudo apt update && sudo apt install -y git git-crypt curl jq
```

### Install on macOS

```bash
brew install git git-crypt curl jq
```

### Install Bitwarden CLI

```bash
# Linux
npm install -g @bitwarden/cli

# macOS
brew install bitwarden-cli
```

### Verify all tools

```bash
make doctor
```

---

## Configuration

Configuration is split into two files with distinct purposes:

### `globals.env` — static KEY=VALUE

Safe for `make include`, `docker-compose`, shell `source`, and any standard env parser.
Contains only simple scalar values — no bash syntax, no arrays.

```bash
# Provider selection
PROVIDER=forgejo            # forgejo | github

# Forgejo (active when PROVIDER=forgejo)
FORGEJO_URL=https://git.yourserver.com
FORGEJO_USER=yourusername
FORGEJO_SSH_PORT=222        # Default for many homelab Forgejo setups; 22 for standard

# GitHub (active when PROVIDER=github — uncomment to use)
# GITHUB_USER=your-username
# GITHUB_ORG=your-org       # Optional: for org repos, overrides GITHUB_USER

# Vaultwarden
VAULTWARDEN_URL=https://vault.yourserver.com

# Master repository
MASTER_REPO=master-knowledge-genome
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git
```

### `registry.sh` — bash runtime config

Sourced by shell scripts only. Contains the genome registry array and dynamic path
resolution. Never included by Make.

```bash
# Dynamic paths (resolved at source time)
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
KEYS_DIR="${WORK_DIR}/keys"

# Genome registry — format: "name|description|linked_repo"
# The third and fourth fields are OPTIONAL:
#   - leave it empty  → knowledge-only genome (no linked project)
#   - owner/repo      → genome is linked to that project repository (rendered into AGENTS.md)
#   - cross_source    → yes|no (default no): whether the cross-genome collector may read this genome as a source
GENOMES=(
  "genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app|no"
  "genome-finance|Personal finance, investments, market analysis||no"
  "genome-homelab|Infrastructure, network configs, architecture logs||no"
)
```

To add a genome to the registry before running setup, append a line to `GENOMES`.
After initial setup, use `make add-genome` instead.

### Tokens

Tokens are never stored in config files. Export them in your shell before running setup:

```bash
export FORGEJO_TOKEN="your_forgejo_token"
# or
export GITHUB_TOKEN="your_github_token"
```

---

## Quick Start

```bash
# 1. Clone the setup framework
git clone <setup-repo-url> knowledge-genome-orchestrator
cd knowledge-genome-orchestrator

# 2. Configure your environment
cp globals.env.example globals.env   # edit with your values
# Edit registry.sh to define your genomes

# 3. Export your provider token
export FORGEJO_TOKEN="your_token_here"

# 4. Verify dependencies
make doctor

# 5. Run full setup
make setup
```

`make setup` executes in order:

1. **Dependency check** — verifies all required tools are installed
2. **Git identity check** — warns if `user.name` / `user.email` are not configured
3. **Master repo** — creates `master-knowledge-genome` on Forgejo, scaffolds with
   `AGENTS.md` and `README.md`, initialises git, adds `core-karpathy` as submodule, pushes
4. **Genome provisioning** — for each genome in `registry.sh`:
   - Creates remote repository on Forgejo
   - Adds it as a submodule in the master repo
   - Initialises git-crypt (**before any files are created**)
   - Scaffolds directory structure and renders all templates
   - Installs pre-commit security hook
   - Commits, pushes genome to remote
   - Exports symmetric key to `keys/<genome>.key`
   - Prints Vaultwarden upload instructions
   - Commits submodule pointer in master repo

After setup completes:

- Upload all files in `keys/` to Vaultwarden (see Key Management)
- Delete key files from disk: `rm keys/*.key`

---

## Makefile Reference

| Target                                                | Description                                                                           |
| ----------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `make setup`                                          | Full system initialisation — master repo + all genomes in `registry.sh`               |
| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project)                   |
| `make lint`                                           | Run quality checks across all genomes (schema, privacy, decay, page size)             |
| `make verify-structure`                               | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`)    |
| `make sync-structure`                                 | Create any missing canonical directories across all genomes (safe, idempotent)        |
| `make test`                                           | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) |
| `make status`                                         | Show submodule status and per-genome git-crypt encryption state                       |
| `make lock`                                           | Lock all encrypted repos (master + all genome submodules)                             |
| `make doctor`                                         | Verify required tools: git, git-crypt, curl, jq; warn if bw missing                   |
| `make sync`                                           | `git submodule update --init --recursive` + report unpushed commits per genome        |
| `make help`                                           | Print all available targets                                                           |

### Examples

```bash
# Check system health
make doctor

# Add a new genome after initial setup
make add-genome NAME=genome-research DESC="Academic papers and deep research"

# Add a genome linked to a project repository
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app

# Check every genome against the canonical directory layout
make verify-structure

# Run full lint pass (bash deterministic checks)
make lint

# Sync all nodes after pulling on another machine
make sync

# Emergency lock — secures all repos before leaving a session
make lock
```

---

## Testing

The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is
covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are
**deterministic and have zero dependency on the LLM, the GPU, or the network** — they
simulate the agent's output with fixtures and exercise the scripts directly, so they run
anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI
node or via n8n.

```bash
sudo apt install bats        # once
make test                    # or: bats tests/
```

| File              | Covers                                                                         |
| ----------------- | ------------------------------------------------------------------------------ |
| `scripts.bats`    | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) |
| `lint.bats`       | `lib/lint.sh` validators + `scoped-lint.sh`                                    |
| `structure.bats`  | `lib/structure.sh` report / sync                                               |
| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq`           |

Each test builds its own throwaway genome with a local bare remote, configured to ignore
the operator's global git settings (signing, global hooks) so the suite is hermetic. The
`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in
`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match.

> Why this matters: the only non-deterministic part of the system is the model. Pinning
> the mechanical layer with tests means that when an ingest misbehaves, you know it's the
> model or the prompt — not the plumbing.

---

## Genome Lifecycle

### Initial setup

All genomes defined in `registry.sh` are provisioned by `make setup`.

### Adding a genome after initial setup

```bash
make add-genome NAME=genome-newname DESC="Domain description"
```

This: creates the remote repo, adds it as a submodule, initialises git-crypt,
scaffolds the directory structure, installs the pre-commit hook, commits and pushes,
exports the key, and commits the submodule pointer in master.

After adding: upload the new key to Vaultwarden and delete the key file.

### Removing a genome

Manual process:

```bash
# In master repo
git submodule deinit genome-name
git rm genome-name
git commit -m "chore: remove genome-name submodule"
git push
# Archive or delete the remote repository on Forgejo
```

### Template rendering

When a genome is scaffolded, `render_template` replaces these placeholders in all
template files:

| Placeholder             | Source      | Example                        |
| ----------------------- | ----------- | ------------------------------ |
| `{{GENOME_NAME}}`       | registry.sh | `genome-dev`                   |
| `{{GENOME_NAME_UPPER}}` | derived     | `GENOME-DEV`                   |
| `{{GENOME_DESC}}`       | registry.sh | `Web development...`           |
| `{{LINKED_PROJECT}}`    | registry.sh | `myorg/my-app` (or `none`)     |
| `{{FORGEJO_URL}}`       | globals.env | `https://git.yourserver.com`   |
| `{{FORGEJO_USER}}`      | globals.env | `yourusername`                 |
| `{{VAULTWARDEN_URL}}`   | globals.env | `https://vault.yourserver.com` |
| `{{MASTER_REPO}}`       | globals.env | `master-knowledge-genome`      |
| `{{DATE}}`              | runtime     | `2026-05-11`                   |

---

## Security Model

### Encryption architecture

Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt.
Two directories in every genome are always encrypted:

| Directory       | Contents                    | On remote          |
| --------------- | --------------------------- | ------------------ |
| `raw/private/`  | Sensitive source material   | Opaque binary blob |
| `wiki/private/` | Private synthesis and notes | Opaque binary blob |

All other directories (`raw/articles/`, `wiki/sources/`, etc.) are plaintext.
Collaborators without the key can contribute to public directories normally —
git handles encrypted files transparently.

### `.gitattributes` — dynamic encryption rules

Encryption rules use a glob wildcard that catches any `private/` directory at
any depth in the repository — including directories created at runtime by the LLM:

```gitattributes
# Text rules first
*.md     text eol=lf
*.sh     text eol=lf

# Encryption rules LAST (later rules override per-attribute)
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
**/private/**   filter=git-crypt diff=git-crypt -text
```

> Rule ordering matters: in `.gitattributes`, the last matching rule wins per attribute.
> Encryption rules must come after text rules so `-text` overrides `text eol=lf`
> for encrypted markdown files.

### Pre-commit hook — dynamic validation

The security hook installed at `.git/hooks/pre-commit` validates every staged file
dynamically — it reads encryption requirements from `.gitattributes` at runtime
rather than checking hardcoded paths:

```bash
# For each staged file, check if git-crypt encryption is required
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
if [[ "$filter" == "git-crypt" ]]; then
    # Verify the file is actually encrypted
    if git-crypt status "$file" | grep -q "not encrypted"; then
        # BLOCK THE COMMIT
    fi
fi
```

This means: any file matching `**/private/**` in `.gitattributes` is protected,
including future `private/` directories created anywhere in the repo.
The hook never needs updating when the encryption rules change.

### Untrusted agent output — manifest validation

The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
**validates the manifest before trusting any field** — it must be well-formed JSON with a
string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
knowledge tree. This is the trust boundary between the (stochastic) model and the
(deterministic, tested) post-processor.

### PRIVATE_CONTEXT toggle

The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
accesses encrypted directories. It must be declared explicitly by the operator
at the start of every session:

```text
PRIVATE_CONTEXT: disabled   ← Default. private/ directories are treated as non-existent.
PRIVATE_CONTEXT: enabled    ← Agent may read/write private/. Requires git-crypt unlock.
```

Rules:

- Never inferred. Never carried over from a previous session.
- `enabled` requires the operator to confirm that `git-crypt unlock` has run on the host.
- Per-genome, per-session: enabling for `genome-finance` does NOT enable for `genome-dev`.
- Cloud LLM models: `PRIVATE_CONTEXT` must always be `disabled`. Private data never leaves the local network.
- All outputs derived from private data are prefixed `[PRIVATE DATA INCLUDED]`.
- Private synthesis goes exclusively to `wiki/private/` — never to public wiki paths.

### Runtime key injection — zero disk policy

Encryption keys are never stored as persistent files on the AI server.
They are injected at session start via the Bitwarden CLI (`bw`) against
your self-hosted Vaultwarden instance, using process substitution:

```bash
# Step 1: authenticate
bw config server https://vault.yourserver.com
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)

# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
git-crypt unlock <(
  bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)
```

The key flows: Vaultwarden → `bw get notes` → `base64 -d` → kernel pipe → `git-crypt`.
At no point is the key written to any file on disk.

Lock a genome when the session ends:

```bash
git-crypt lock
```

---

## Key Management

> This section is for the operator. These commands are never issued by the LLM agent.

### Vaultwarden Secure Notes

Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:

| Genome           | Vaultwarden Note Name |
| ---------------- | --------------------- |
| `genome-dev`     | `genome-dev key`      |
| `genome-finance` | `genome-finance key`  |
| `genome-homelab` | `genome-homelab key`  |

After `make setup` or `make add-genome`, key files are exported to `keys/`.
Upload procedure:

```bash
# Encode the key
base64 < keys/genome-dev.key

# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
# Then delete the key file
rm keys/genome-dev.key
```

### Cloning on a new machine

```bash
# Full clone with all submodules
git clone --recurse-submodules \
  https://git.yourserver.com/yourusername/master-knowledge-genome.git

# Unlock a specific genome (with key file — development only)
cd master-knowledge-genome/genome-dev
git-crypt unlock /path/to/genome-dev.key

# Unlock via Vaultwarden (recommended — no key on disk)
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)

# Sparse clone — collaborator who only needs one genome
git clone https://git.yourserver.com/yourusername/genome-dev.git
```

### Key rotation (emergency)

If a key is lost or compromised:

```bash
# From the knowledge-genome-orchestrator/ directory
source lib/git-crypt.sh
# If gcrypt_rotate_key operates on the CWD: cd into .../master-knowledge-genome/genome-dev
# If it navigates by name instead:          cd into .../master-knowledge-genome
cd ~/knowledge-genome-orchestrator/master-knowledge-genome
gcrypt_rotate_key "genome-dev"
```

> **macOS:** `gcrypt_rotate_key` uses `compgen -G` (bash 4+). The stock macOS bash 3.2 is not
> enough — run rotation under Homebrew bash (`brew install bash`).

`gcrypt_rotate_key` performs:

1. Unlocks repo with existing key
2. Removes old key material
3. Generates new symmetric key via `git-crypt init`
4. Re-stages and commits private files (encrypted with new key)
5. Exports new key to `keys/`
6. Prints Vaultwarden update instructions

> **Limitation:** git history still contains blobs encrypted with the old key.
> Anyone with the old key and git history access can decrypt them. To purge old
> encrypted blobs from history:
>
> ```bash
> git filter-repo --invert-paths --path raw/private --path wiki/private
> git push --force origin main
> ```
>
> This rewrites all commit hashes — coordinate with any collaborators first.

After rotation:

- Upload new key to Vaultwarden (replace existing note)
- Delete both `keys/genome-dev.key` and `keys/genome-dev-rotated-*.key` from disk
- Revoke access from previous key holders

---

## Agent Sessions

### Prerequisites for every session

Before starting an LLM agent session on a genome:

1. The host (AI server) runs `git-crypt unlock` for the required genomes
2. The orchestrator prepares context: `tail -n 20 wiki/log.md`
3. Declare `PRIVATE_CONTEXT` state explicitly in the opening prompt

### Session start protocol

The agent executes in this order at the start of every session:

1. Read `wiki/index.md` — primary catalog of all pages and maturity
2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
3. For tasks involving related pages: if the optional `qmd` extension is installed,
   `qmd search "<query>"` before opening files; otherwise navigate from `wiki/index.md`
4. Operate on individual files — never scan entire directories

### One source per session

With a 14B model and ~6GB KV cache budget, long sessions degrade.
As the session extends, the context fills with pages already created,
attention dilutes, and later entities receive worse cross-references than earlier ones.

**Hard rule: one source per session.**
If multiple sources are queued in `raw/`, process only the first.
Commit, close the session. The orchestrator (n8n or script) starts a new session
for the next source with a clean KV cache.

For automated pipelines: if 5 files arrive in `raw/`, trigger 5 agent sessions
sequentially — not one session with 5 files.

### n8n automation

For Forgejo webhook → automated ingest:

1. Forgejo sends webhook on push to `raw/`
2. n8n receives webhook, identifies new files
3. n8n starts one agent session per new file (sequential, not parallel)
4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
6. Human reviews — **merge to accept**, or close the PR + delete the `feat` branch to reject

---

## Workflows

### Ingest

Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two
phases so that the small local model spends its limited context only on judgement, and
all the deterministic bookkeeping happens outside the model's loop.

**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools
only (no shell). It:

1. Reads the source once
2. Creates `wiki/sources/<slug>.md` — summary and key points
3. Per entity (person, tool, organisation): creates or updates `wiki/entities/<name>.md`
4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/<name>.md`
5. Checks each touched page for contradictions → applies Conflict Resolution if found
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
   a one-line reasoning, the PR summary, and any contradictions) — then **stops**

**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
**validates the manifest** — well-formed JSON, expected shape, and every page path confined to
`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
model must not waste context on:

7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
   deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
   index `last_updated` (`index-append.py`)
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
   orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
   `lib/lint.sh`)
10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
    base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
    structure (Summary / Pages / Contradictions / Scoped Lint)
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n

The agent never runs git, never edits the index/log mechanically, and never lints — those
are deterministic and tested (see [Testing](#testing)). Invocation on the AI node:

```bash
pi --mode json -p "/skill:ingest raw/articles/<file>.md"   # phase 1 → writes manifest
run-ingest.sh <genome>                                     # phase 2 → index/log/lint/PR
```

For private sources (`PRIVATE_CONTEXT: enabled` required):

- All output goes to `wiki/private/<slug>.md` only
- PR title: `[PRIVATE] ingest: <slug>`

**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
the orchestrator, around the human gate:

- **Before each session** the orchestrator realigns the checkout to the base
  (`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
  checkout to match the remote, never a force-push to the shared branch.
- **After the PR opens, everything stops** until a human approves: one source per session,
  sequential, no new ingest until the pending PR is closed.
- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
  already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
  shared branch.

The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
buffer ingests on `develop` and cut manual `develop → main` releases — no code change.

### Query

Triggered by an operator question.

1. `qmd search "<query>"` (if the optional qmd extension is installed) → identify
   candidate pages; otherwise start from `wiki/index.md`
2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
3. Synthesise answer with `[[wikilink]]` citations
4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
5. Append log entry: `QUERY | <subject>`

For general orientation without a specific query: read `wiki/index.md` directly.

### Lint

The lint workflow is split between deterministic bash checks and semantic LLM judgment.

**Step 1 — operator runs bash linter:**

```bash
make lint
```

The bash linter checks automatically:

- YAML frontmatter validity (all mandatory fields present)
- Domain consistency (domain field matches genome name)
- Type validity (value from allowed list)
- Privacy consistency (`private/` directories have `private: true`)
- Page size (warn at 400 lines, error at 800 lines)
- Knowledge decay (stable > 180 days, draft > 90 days)
- Broken internal wikilinks (warnings only — cross-type links produce expected false positives)

**Step 2 — operator provides bash output to LLM agent:**

The agent applies semantic judgment to findings the bash linter cannot make:

- **Orphan pages** (from bash list): for each orphan, identify 1-3 existing pages
  that should link to it; propose specific additions
- **Implicit concepts** (from bash term frequency list): determine if a candidate
  term warrants a dedicated page; draft stub if yes
- **Duplicate concepts**: `qmd search "<concept>"` for suspected duplicates;
  propose merge if confirmed
- **Maturity promotion**: pages with 2+ sources still marked `draft` → propose `stable`

The agent reports all findings as a structured list. It does not modify files
without operator approval. Appends `LINT | <summary>` log entry.

---

## Knowledge Quality

### PR review workflow

Every agent session that modifies wiki pages opens a PR.
The PR description uses `templates/pr-description.md`:

```markdown
## Summary

One sentence: goal of this session and source processed.

## Pages Created

| Path | Type | Maturity |

## Pages Modified

| Path | Change |

## Contradictions Found

[ ] None / [ ] n conflict file(s) created

## Private Data Accessed

[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes

## Scoped Lint (post-ingest)

[ ] Frontmatter valid [ ] No broken links [ ] No issues found
```

This makes human review fast and structured: read the table, scan the diff,
approve or request changes. No exploration required to understand what the agent did.

### Conflict resolution

When new evidence contradicts an existing wiki claim:

1. Keep the existing page unchanged
2. Create `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md` with:
   - The existing claim and its source
   - The contradicting evidence and its source
   - Agent confidence assessment for each
   - Recommendation: `accept_b` | `keep_a` | `requires_human_review`
3. Add entry to `wiki/index.md` → Conflicts Pending Review section
4. Log entry: `CONFLICT | <concept>`
5. Open PR: `[CONFLICT] <concept> — human review required`

The operator resolves the conflict, updates relevant pages, closes the PR.

### Knowledge decay

Pages have a `last_updated` field in frontmatter. During lint passes:

| Maturity | Threshold | Action                                 |
| -------- | --------- | -------------------------------------- |
| `stable` | 180 days  | Flag as stale — add `⚠️ STALE` callout |
| `draft`  | 90 days   | Flag as stale — add `⚠️ STALE` callout |

The agent proposes re-validation but does not change `maturity` without new source evidence.

### Cross-genome references

Cross-domain knowledge moves by **pull, never push**: the genome you are working in draws
material _in_; nothing is ever written into another genome. There are **no cross-genome
wikilinks** — submodule pointers make relative paths brittle.

When the working genome needs a concept that lives elsewhere, the **navigation skill** handles
it in the same two-phase shape as ingest:

1. A deterministic collector clones the relevant genomes **read-only at HEAD** (fresh — never the
   pinned submodule state) and assembles a dossier of excerpts with provenance.
2. A semantic pass reads only that dossier; the skill then deposits **one** abstract, non-private
   raw into the working genome at `raw/articles/crossgen-<topic>-<date>.md`.
3. That raw goes through the working genome's normal ingest → PR → human gate, like any source.

Which genomes may be read as **sources** is gated by a per-genome `cross_source: yes|no` flag: a
confidential genome (e.g. a client file) is marked `no` and is never read as a source — the wall
is structural, not a matter of the agent's discipline. The master `AGENTS.md` holds the full
boundary contract.

---

## Knowledge Schema

### Frontmatter

Every wiki page must start with valid YAML frontmatter:

```yaml
---
title: "Strict String Title"
type: source | entity | concept | query | conflict | private
domain: genome-name
tags: [lowercase, hyphen-separated]
maturity: draft | stable | deprecated
last_updated: YYYY-MM-DD
private: true | false
---
```

| Field                  | Rules                                                                    |
| ---------------------- | ------------------------------------------------------------------------ |
| `type`                 | Must be one of: `source entity concept query conflict private index log` |
| `maturity: draft`      | Single source or unvalidated                                             |
| `maturity: stable`     | Confirmed by 2+ independent sources                                      |
| `maturity: deprecated` | Superseded — add `> **DEPRECATED:** <reason>` callout at top             |
| `private: true`        | Required on all pages in `wiki/private/` and `raw/private/`              |

Do not use semantic versioning for content. Git history tracks every change.
`maturity` captures epistemic state; `last_updated` tracks recency.

### Page types and directories

| Type       | Directory                    | Description                                  |
| ---------- | ---------------------------- | -------------------------------------------- |
| `source`   | `wiki/sources/`              | One page per processed raw source            |
| `entity`   | `wiki/entities/`             | People, tools, organisations, projects       |
| `concept`  | `wiki/concepts/`             | Patterns, theories, architectural decisions  |
| `query`    | `wiki/queries/`              | Preserved answers and analyses               |
| `conflict` | `wiki/queries/conflict-*.md` | Unresolved contradictions                    |
| `private`  | `wiki/private/`              | Private synthesis (PRIVATE_CONTEXT: enabled) |
| `index`    | `wiki/index.md`              | Primary navigation catalog (singleton)       |
| `log`      | `wiki/log.md`                | Operations ledger (singleton)                |

### Page size limits

| Limit    | Lines | Action                              |
| -------- | ----- | ----------------------------------- |
| Soft cap | 400   | Bash linter warns                   |
| Hard cap | 800   | Bash linter errors — split the page |

These limits ensure pages fit within the LLM context window without attention degradation
and keep the wiki atomically navigable.

### Linking conventions

- **Intra-genome:** `[[folder/file]]` — Obsidian wikilinks only.
- **Cross-genome:** NOT supported via wikilink — submodule pointers make relative paths brittle. When the working genome needs a concept that lives elsewhere, the navigation skill **pulls it in** as one abstract raw under _this_ genome's `raw/articles/`, which then goes through normal ingest. See [Cross-genome references](#cross-genome-references).
- **External:** `[text](https://...)` — standard Markdown.

### Log format

Every operation appends one entry to `wiki/log.md`:

```markdown
## [YYYY-MM-DD] TYPE | Subject

- run_id: `<uuid>`
- model: `<model-name>`
- context_read: `[[path/A]]`, `[[path/B]]`
- output_written: `[[path/C]]`
- reasoning: One sentence — what changed and why.
```

Valid TYPEs: `INGEST` `LINT` `QUERY` `CONFLICT` `CONFIG` `SECURITY`

Parse examples:

```bash
grep "^## \[" wiki/log.md | tail -5          # Last 5 entries
grep "^## \[" wiki/log.md | grep "CONFLICT"  # All conflicts
grep "^## \[2026-05" wiki/log.md             # Entries from a specific month
```

The orchestrator always injects only `tail -n 20 wiki/log.md` into agent context.
The LLM never loads the full log.

---

## Collaboration Model

| Role           | Key access        | Permitted operations                                                          |
| -------------- | ----------------- | ----------------------------------------------------------------------------- |
| Owner          | Full — key holder | Read/write everywhere                                                         |
| Collaborator   | None              | Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/` |
| Local AI agent | Conditional       | `private/` only when `PRIVATE_CONTEXT: enabled`                               |
| Cloud AI model | Never             | `PRIVATE_CONTEXT` must be `disabled`; private data stays on local network     |

Grant collaborator access: add as Forgejo contributor with Write role.
Never share the git-crypt key — collaborators operate exclusively in public directories.

---

## Optional Extensions

### qmd — local Markdown search

[qmd](https://github.com/tobi/qmd) is a local, on-device BM25 + vector search
engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls)
and an MCP server (for native LLM tool use).

Recommended at scale: once a genome exceeds ~150 pages, `qmd search` is significantly
faster and more accurate than navigating `wiki/index.md` manually.

```bash
# Index a genome
qmd index genome-dev/wiki/

# Search
qmd search "graph-based state management"

# Start MCP server (for Claude Code / Codex integration)
qmd serve --port 3333
```

### Obsidian integration

Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.

Recommended setup:

- **Graph view** — visualise page connections; spot orphans and hubs instantly
- **Obsidian Web Clipper** — browser extension to clip articles directly to `raw/articles/`
  as Markdown
- **Download attachments** — Settings → Hotkeys → "Download attachments for current file".
  Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to `raw/assets/`
- **Dataview plugin** — query YAML frontmatter across the wiki;
  `TABLE maturity, last_updated WHERE domain = "genome-dev"` generates dynamic tables
- **Marp plugin** — render Markdown as slide decks directly from wiki content

Note: `.obsidian/` is in `.gitignore`. Workspace and plugin settings are local — not synced.

### n8n automation

n8n (running on the storage node) can automate the ingest pipeline:

1. Forgejo webhook fires on push to a genome's `raw/` directory
2. n8n flow identifies new files
3. For each new file: starts one agent session (sequential — never parallel)
4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 —
   `run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n
6. Human reviews the PR

Key constraint: one source per session, sessions sequential.
Never batch multiple sources into one agent session.

### Intel NPU offloading

If the AI compute node has an Intel NPU (e.g. Core Ultra series):

- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd
  re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO
- Active reasoning sessions (ingest, query, synthesis) → GPU

Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)),
so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the
GPU's KV cache free for interactive sessions and lowers power draw for background jobs.

---

## Troubleshooting

### `git-crypt: command not found`

```bash
# Ubuntu/Debian
sudo apt install git-crypt

# macOS
brew install git-crypt
```

### `make setup` fails with "MISSING: jq"

```bash
make doctor   # identifies all missing tools
sudo apt install git git-crypt curl jq
```

### Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"

The staged file is in a path matching `**/private/**` but is not encrypted.

Fix options:

1. Verify `.gitattributes` contains `**/private/** filter=git-crypt diff=git-crypt -text`
2. Run `git-crypt init` if git-crypt is not initialised in this repo
3. Run `git-crypt status` to check the encryption state of all files

Never use `git commit --no-verify` to bypass this check.

### `git-crypt status` shows files as "not encrypted" after init

The `.gitattributes` rule must be committed before files in `private/` are staged.
If files were staged before `.gitattributes` was committed:

```bash
git rm -r --cached raw/private/ wiki/private/
git add raw/private/ wiki/private/
git commit -m "fix: re-stage private files for encryption"
```

### Agent returns stale or missing cross-references

Likely causes:

1. Session was too long — KV cache degraded. Use one source per session.
2. `wiki/index.md` was not read at session start — agent lacked the page catalog.
3. qmd index is stale — re-index: `qmd index <genome>/wiki/`

### Submodules show as "modified" after `make sync`

This is normal if genome repos have new commits. Update master's pointers:

```bash
cd master-knowledge-genome
git add .
git commit -m "chore: update submodule pointers"
git push
```

### bw unlock fails

Verify you are using `bw` (standard Bitwarden CLI), not `bws` (Secrets Manager CLI).
`bws` does not work with self-hosted Vaultwarden.

```bash
bw --version     # should print e.g. "2024.x.x"
bw config server https://vault.yourserver.com
bw login
```