docs: Update README

2026-06-05 09:27:14 +02:00 · 2026-06-05 09:27:14 +02:00 · 13d34b4906
commit 13d34b4906
parent 42c1302035
1 changed files with 170 additions and 58 deletions
--- a/README.md
+++ b/README.md
@ -19,16 +19,17 @@ and a human-in-the-loop Git Flow for quality control.
 5. [Configuration](#configuration)
 6. [Quick Start](#quick-start)
 7. [Makefile Reference](#makefile-reference)
-8. [Genome Lifecycle](#genome-lifecycle)
-9. [Security Model](#security-model)
-10. [Key Management](#key-management)
-11. [Agent Sessions](#agent-sessions)
-12. [Workflows](#workflows)
-13. [Knowledge Quality](#knowledge-quality)
-14. [Knowledge Schema](#knowledge-schema)
-15. [Collaboration Model](#collaboration-model)
-16. [Optional Extensions](#optional-extensions)
-17. [Troubleshooting](#troubleshooting)
+8. [Testing](#testing)
+9. [Genome Lifecycle](#genome-lifecycle)
+10. [Security Model](#security-model)
+11. [Key Management](#key-management)
+12. [Agent Sessions](#agent-sessions)
+13. [Workflows](#workflows)
+14. [Knowledge Quality](#knowledge-quality)
+15. [Knowledge Schema](#knowledge-schema)
+16. [Collaboration Model](#collaboration-model)
+17. [Optional Extensions](#optional-extensions)
+18. [Troubleshooting](#troubleshooting)

 ---

@ -110,10 +111,18 @@ genome-{name}/
 | Wiki        | `wiki/`     | LLM         | Agent creates, updates, cross-links, maintains.       |
 | Schema      | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |

+### Linked projects (optional)
+
+A genome can optionally declare a **linked project repository** — a separate repo where
+the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app
+repo). The link is recorded as a third field in the registry and rendered into the
+genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and
+behaves exactly as before. See [Configuration](#configuration).
+
 ### Framework structure

 ```text
-knowledge-genome-setup/               ← This repository (setup tooling)
+knowledge-genome-orchestrator/        ← This repository (setup tooling)
 ├── globals.env                       ← Static KEY=VALUE config (Make-includable)
 ├── registry.sh                       ← Bash-only: GENOMES array + dynamic paths
 ├── Makefile                          ← Entry point for all operations
@ -121,6 +130,7 @@ knowledge-genome-setup/               ← This repository (setup tooling)
 │   ├── output.sh                     ← Terminal helpers (colors, log levels)
 │   ├── deps.sh                       ← Dependency validation
 │   ├── scaffold.sh                   ← Template rendering engine
+│   ├── structure.sh                  ← Canonical genome layout (single source of truth)
 │   ├── lint.sh                       ← Per-file validation functions
 │   └── git-crypt.sh                  ← git-crypt lifecycle (init, export, verify, rotate)
 ├── providers/
@ -131,18 +141,41 @@ knowledge-genome-setup/               ← This repository (setup tooling)
 │   ├── setup-master.sh               ← Master repo initialisation
 │   ├── setup-genomes.sh              ← Genome provisioning loop
 │   ├── add-genome.sh                 ← Add a single new genome
-│   └── lint-genomes.sh               ← Quality control across all genomes
-└── templates/
-    ├── agents-genome.md              ← Per-genome agent contract template
-    ├── agents-master.md              ← Master coordination schema template
-    ├── wiki-index.md                 ← Index template (rendered per genome)
-    ├── wiki-log.md                   ← Log template (rendered per genome)
-    ├── pr-description.md             ← PR review checklist template
-    ├── pre-commit.sh                 ← Security hook template
-    ├── gitattributes                 ← Git encryption rules template
-    └── gitignore                     ← Git ignore template
+│   ├── lint-genomes.sh               ← Quality control across all genomes
+│   └── verify-genomes.sh             ← Structure verify / --sync across all genomes
+├── templates/
+│   ├── agents-genome.md              ← Per-genome agent contract template
+│   ├── agents-master.md              ← Master coordination schema template
+│   ├── readme-master.md              ← Master repo README template
+│   ├── wiki-index.md                 ← Index template (rendered per genome)
+│   ├── wiki-log.md                   ← Log template (rendered per genome)
+│   ├── pr-description.md             ← PR review checklist template
+│   ├── pre-commit.sh                 ← Security hook template
+│   ├── gitattributes                 ← Git encryption rules template
+│   └── gitignore                     ← Git ignore template
+├── skills/
+│   └── ingest/                       ← pi skill: deployed to the AI node (vm101)
+│       ├── SKILL.md                  ← Semantic-only contract (read/edit, emits manifest)
+│       ├── references/               ← On-demand reference docs for the agent
+│       └── scripts/                  ← Deterministic post-processor (runs outside the agent)
+│           ├── run-ingest.sh         ← Orchestrator: consumes the manifest, emits one JSON line
+│           ├── slug.sh               ← Slug normalisation
+│           ├── index-append.py       ← Sorted insert into wiki/index.md + last_updated bump
+│           ├── log-append.sh         ← Append a wiki/log.md entry
+│           ├── scoped-lint.sh        ← Lint only the pages touched this run (reuses lib/lint.sh)
+│           └── open-pr.sh            ← Branch / commit / push / open PR (DRY_RUN seam for tests)
+└── tests/                            ← bats suite — deterministic, no LLM/GPU (see Testing)
+    ├── helpers.bash
+    ├── scripts.bats
+    ├── lint.bats
+    ├── structure.bats
+    └── run-ingest.bats
 ```

+> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI
+> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work
+> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest).
+
 ---

 ## System Requirements
@ -156,7 +189,9 @@ All tools (git-crypt, bw, qmd) have native Linux binaries.

 All scripts are compatible with macOS. Requirements:

- bash 3.2+ (macOS default) — fully supported. All `bash 4+` constructs removed.
+- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding).
+  The `ingest` skill uses bash 4+ constructs (`mapfile`), but it is deployed and run on the
+  Linux AI node, not on the macOS setup machine — so this is not a constraint in practice.
 - GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
 - `git-crypt`: install via Homebrew — `brew install git-crypt`
 - `jq`, `curl`: pre-installed or via Homebrew
@ -195,6 +230,11 @@ The system is designed for a homelab architecture:
 > the index, and the log tail is a cost. This is why all agent files are token-optimised
 > and sessions are kept to one source at a time.

+> **Reference deployment:** the table above is a target profile, not a hard requirement.
+> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive
+> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they
+> just make the "one source per session" discipline and the token budget matter more.
+
 ---

 ## Prerequisites
@ -285,14 +325,17 @@ resolution. Never included by Make.

 ```bash
 # Dynamic paths (resolved at source time)
-WORK_DIR="${HOME}/knowledge-genome-setup"
+WORK_DIR="${HOME}/knowledge-genome-orchestrator"
 KEYS_DIR="${WORK_DIR}/keys"

-# Genome registry — format: "name|description"
+# Genome registry — format: "name|description|linked_repo"
+# The third field is OPTIONAL:
+#   - leave it empty  → knowledge-only genome (no linked project)
+#   - owner/repo      → genome is linked to that project repository (rendered into AGENTS.md)
 GENOMES=(
-  "genome-dev|Web development, TUI, Angular, software architecture"
-  "genome-finance|Personal finance, investments, market analysis"
-  "genome-homelab|Infrastructure, network configs, architecture logs"
+  "genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app"
+  "genome-finance|Personal finance, investments, market analysis|"
+  "genome-homelab|Infrastructure, network configs, architecture logs|"
 )
 ```

@ -315,8 +358,8 @@ export GITHUB_TOKEN="your_github_token"

 ```bash
 # 1. Clone the setup framework
-git clone <setup-repo-url> knowledge-genome-setup
-cd knowledge-genome-setup
+git clone <setup-repo-url> knowledge-genome-orchestrator
+cd knowledge-genome-orchestrator

 # 2. Configure your environment
 cp globals.env.example globals.env   # edit with your values
@ -358,16 +401,19 @@ After setup completes:

 ## Makefile Reference

-| Target                            | Description                                                                    |
-| --------------------------------- | ------------------------------------------------------------------------------ |
-| `make setup`                      | Full system initialisation — master repo + all genomes in `registry.sh`        |
-| `make add-genome NAME=x DESC="y"` | Scaffold and register a single new genome                                      |
-| `make lint`                       | Run quality checks across all genomes (schema, privacy, decay, page size)      |
-| `make status`                     | Show submodule status and first 10 git-crypt encryption states                 |
-| `make lock`                       | Lock all encrypted repos (master + all genome submodules)                      |
-| `make doctor`                     | Verify required tools: git, git-crypt, curl, jq; warn if bw missing            |
-| `make sync`                       | `git submodule update --init --recursive` + report unpushed commits per genome |
-| `make help`                       | Print all available targets                                                    |
+| Target                                                | Description                                                                           |
+| ----------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| `make setup`                                          | Full system initialisation — master repo + all genomes in `registry.sh`               |
+| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project)                   |
+| `make lint`                                           | Run quality checks across all genomes (schema, privacy, decay, page size)             |
+| `make verify-structure`                               | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`)    |
+| `make sync-structure`                                 | Create any missing canonical directories across all genomes (safe, idempotent)        |
+| `make test`                                           | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) |
+| `make status`                                         | Show submodule status and per-genome git-crypt encryption state                       |
+| `make lock`                                           | Lock all encrypted repos (master + all genome submodules)                             |
+| `make doctor`                                         | Verify required tools: git, git-crypt, curl, jq; warn if bw missing                   |
+| `make sync`                                           | `git submodule update --init --recursive` + report unpushed commits per genome        |
+| `make help`                                           | Print all available targets                                                           |

 ### Examples

@ -378,6 +424,12 @@ make doctor
 # Add a new genome after initial setup
 make add-genome NAME=genome-research DESC="Academic papers and deep research"

+# Add a genome linked to a project repository
+make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app
+
+# Check every genome against the canonical directory layout
+make verify-structure
+
 # Run full lint pass (bash deterministic checks)
 make lint

@ -390,6 +442,38 @@ make lock

 ---

+## Testing
+
+The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is
+covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are
+**deterministic and have zero dependency on the LLM, the GPU, or the network** — they
+simulate the agent's output with fixtures and exercise the scripts directly, so they run
+anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI
+node or via n8n.
+
+```bash
+sudo apt install bats        # once
+make test                    # or: bats tests/
+```
+
+| File              | Covers                                                                         |
+| ----------------- | ------------------------------------------------------------------------------ |
+| `scripts.bats`    | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) |
+| `lint.bats`       | `lib/lint.sh` validators + `scoped-lint.sh`                                    |
+| `structure.bats`  | `lib/structure.sh` report / sync                                               |
+| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq`           |
+
+Each test builds its own throwaway genome with a local bare remote, configured to ignore
+the operator's global git settings (signing, global hooks) so the suite is hermetic. The
+`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in
+`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match.
+
+> Why this matters: the only non-deterministic part of the system is the model. Pinning
+> the mechanical layer with tests means that when an ingest misbehaves, you know it's the
+> model or the prompt — not the plumbing.
+
+---
+
 ## Genome Lifecycle

 ### Initial setup
@ -431,6 +515,7 @@ template files:
 | `{{GENOME_NAME}}`       | registry.sh | `genome-dev`                   |
 | `{{GENOME_NAME_UPPER}}` | derived     | `GENOME-DEV`                   |
 | `{{GENOME_DESC}}`       | registry.sh | `Web development...`           |
+| `{{LINKED_PROJECT}}`    | registry.sh | `myorg/my-app` (or `none`)     |
 | `{{FORGEJO_URL}}`       | globals.env | `https://git.yourserver.com`   |
 | `{{FORGEJO_USER}}`      | globals.env | `yourusername`                 |
 | `{{VAULTWARDEN_URL}}`   | globals.env | `https://vault.yourserver.com` |
@ -593,9 +678,9 @@ git clone https://git.yourserver.com/yourusername/genome-dev.git
 If a key is lost or compromised:

 ```bash
-# From the knowledge-genome-setup/ directory
+# From the knowledge-genome-orchestrator/ directory
 source lib/git-crypt.sh
-cd ~/knowledge-genome-setup/genome-dev
+cd ~/knowledge-genome-orchestrator/genome-dev
 gcrypt_rotate_key "genome-dev"
 ```

@ -643,7 +728,8 @@ The agent executes in this order at the start of every session:

 1. Read `wiki/index.md` — primary catalog of all pages and maturity
 2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
-3. For tasks involving related pages: `qmd search "<query>"` before opening any files
+3. For tasks involving related pages: if the optional `qmd` extension is installed,
+   `qmd search "<query>"` before opening files; otherwise navigate from `wiki/index.md`
 4. Operate on individual files — never scan entire directories

 ### One source per session
@ -668,7 +754,7 @@ For Forgejo webhook → automated ingest:
 2. n8n receives webhook, identifies new files
 3. n8n starts one agent session per new file (sequential, not parallel)
 4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
-5. Agent ingest workflow runs, opens PR
+5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR
 6. Human reviews and merges PR

 ---
@ -677,17 +763,39 @@ For Forgejo webhook → automated ingest:

 ### Ingest

-Triggered by a new file in `raw/` (manual or via webhook).
+Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two
+phases so that the small local model spends its limited context only on judgement, and
+all the deterministic bookkeeping happens outside the model's loop.

-1. Read source once
-2. Create `wiki/sources/<slug>.md` — summary and key points
-3. Per entity (person, tool, organisation): create or update `wiki/entities/<name>.md`
-4. Per concept (pattern, theory, decision): create or update `wiki/concepts/<name>.md`
-5. Check each touched page for contradictions → apply Conflict Resolution if found
-6. Append entry to `wiki/index.md` (bottom of relevant section — do not reorder)
-7. Append log entry: `INGEST | <slug>`
-8. Run scoped lint on pages created or modified in this session; report in PR
-9. Commit on `feat/ai-ingest-<slug>`; open PR using `templates/pr-description.md`
+**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools
+only (no shell). It:
+
+1. Reads the source once
+2. Creates `wiki/sources/<slug>.md` — summary and key points
+3. Per entity (person, tool, organisation): creates or updates `wiki/entities/<name>.md`
+4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/<name>.md`
+5. Checks each touched page for contradictions → applies Conflict Resolution if found
+6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
+   a one-line reasoning, the PR summary, and any contradictions) — then **stops**
+
+**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor
+consumes the manifest and does the mechanical work the model must not waste context on:
+
+7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**
+   (`index-append.py`) and bumps the index `last_updated`
+8. Appends the `INGEST | <slug>` entry to `wiki/log.md`
+9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
+   `lib/lint.sh`)
+10. Commits on `feat/ai-ingest-<slug>` and opens the PR using `templates/pr-description.md`
+11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
+
+The agent never runs git, never edits the index/log mechanically, and never lints — those
+are deterministic and tested (see [Testing](#testing)). Invocation on the AI node:
+
+```bash
+pi --mode json -p "/skill:ingest raw/articles/<file>.md"   # phase 1 → writes manifest
+run-ingest.sh <genome>                                     # phase 2 → index/log/lint/PR
+```

 For private sources (`PRIVATE_CONTEXT: enabled` required):

@ -698,7 +806,8 @@ For private sources (`PRIVATE_CONTEXT: enabled` required):

 Triggered by an operator question.

-1. `qmd search "<query>"` → identify candidate pages
+1. `qmd search "<query>"` (if the optional qmd extension is installed) → identify
+   candidate pages; otherwise start from `wiki/index.md`
 2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
 3. Synthesise answer with `[[wikilink]]` citations
 4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
@ -974,7 +1083,8 @@ n8n (running on the storage node) can automate the ingest pipeline:
 2. n8n flow identifies new files
 3. For each new file: starts one agent session (sequential — never parallel)
 4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
-5. Agent runs ingest workflow and opens PR
+5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 —
+   `run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n
 6. Human reviews the PR

 Key constraint: one source per session, sessions sequential.
@ -984,11 +1094,13 @@ Never batch multiple sources into one agent session.

 If the AI compute node has an Intel NPU (e.g. Core Ultra series):

- Background tasks (embedding updates, index refresh) → Intel NPU via OpenVINO
+- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd
+  re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO
 - Active reasoning sessions (ingest, query, synthesis) → GPU

-This keeps the GPU's KV cache free for interactive work and reduces power consumption
-for background operations.
+Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)),
+so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the
+GPU's KV cache free for interactive sessions and lowers power draw for background jobs.

 ---