docs: Update README

This commit is contained in:
Matteo Cherubini 2026-06-05 09:27:14 +02:00
parent 42c1302035
commit 13d34b4906

228
README.md
View file

@ -19,16 +19,17 @@ and a human-in-the-loop Git Flow for quality control.
5. [Configuration](#configuration)
6. [Quick Start](#quick-start)
7. [Makefile Reference](#makefile-reference)
8. [Genome Lifecycle](#genome-lifecycle)
9. [Security Model](#security-model)
10. [Key Management](#key-management)
11. [Agent Sessions](#agent-sessions)
12. [Workflows](#workflows)
13. [Knowledge Quality](#knowledge-quality)
14. [Knowledge Schema](#knowledge-schema)
15. [Collaboration Model](#collaboration-model)
16. [Optional Extensions](#optional-extensions)
17. [Troubleshooting](#troubleshooting)
8. [Testing](#testing)
9. [Genome Lifecycle](#genome-lifecycle)
10. [Security Model](#security-model)
11. [Key Management](#key-management)
12. [Agent Sessions](#agent-sessions)
13. [Workflows](#workflows)
14. [Knowledge Quality](#knowledge-quality)
15. [Knowledge Schema](#knowledge-schema)
16. [Collaboration Model](#collaboration-model)
17. [Optional Extensions](#optional-extensions)
18. [Troubleshooting](#troubleshooting)
---
@ -110,10 +111,18 @@ genome-{name}/
| Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. |
| Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |
### Linked projects (optional)
A genome can optionally declare a **linked project repository** — a separate repo where
the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app
repo). The link is recorded as a third field in the registry and rendered into the
genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and
behaves exactly as before. See [Configuration](#configuration).
### Framework structure
```text
knowledge-genome-setup/ ← This repository (setup tooling)
knowledge-genome-orchestrator/ ← This repository (setup tooling)
├── globals.env ← Static KEY=VALUE config (Make-includable)
├── registry.sh ← Bash-only: GENOMES array + dynamic paths
├── Makefile ← Entry point for all operations
@ -121,6 +130,7 @@ knowledge-genome-setup/ ← This repository (setup tooling)
│ ├── output.sh ← Terminal helpers (colors, log levels)
│ ├── deps.sh ← Dependency validation
│ ├── scaffold.sh ← Template rendering engine
│ ├── structure.sh ← Canonical genome layout (single source of truth)
│ ├── lint.sh ← Per-file validation functions
│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate)
├── providers/
@ -131,18 +141,41 @@ knowledge-genome-setup/ ← This repository (setup tooling)
│ ├── setup-master.sh ← Master repo initialisation
│ ├── setup-genomes.sh ← Genome provisioning loop
│ ├── add-genome.sh ← Add a single new genome
│ └── lint-genomes.sh ← Quality control across all genomes
└── templates/
├── agents-genome.md ← Per-genome agent contract template
├── agents-master.md ← Master coordination schema template
├── wiki-index.md ← Index template (rendered per genome)
├── wiki-log.md ← Log template (rendered per genome)
├── pr-description.md ← PR review checklist template
├── pre-commit.sh ← Security hook template
├── gitattributes ← Git encryption rules template
└── gitignore ← Git ignore template
│ ├── lint-genomes.sh ← Quality control across all genomes
│ └── verify-genomes.sh ← Structure verify / --sync across all genomes
├── templates/
│ ├── agents-genome.md ← Per-genome agent contract template
│ ├── agents-master.md ← Master coordination schema template
│ ├── readme-master.md ← Master repo README template
│ ├── wiki-index.md ← Index template (rendered per genome)
│ ├── wiki-log.md ← Log template (rendered per genome)
│ ├── pr-description.md ← PR review checklist template
│ ├── pre-commit.sh ← Security hook template
│ ├── gitattributes ← Git encryption rules template
│ └── gitignore ← Git ignore template
├── skills/
│ └── ingest/ ← pi skill: deployed to the AI node (vm101)
│ ├── SKILL.md ← Semantic-only contract (read/edit, emits manifest)
│ ├── references/ ← On-demand reference docs for the agent
│ └── scripts/ ← Deterministic post-processor (runs outside the agent)
│ ├── run-ingest.sh ← Orchestrator: consumes the manifest, emits one JSON line
│ ├── slug.sh ← Slug normalisation
│ ├── index-append.py ← Sorted insert into wiki/index.md + last_updated bump
│ ├── log-append.sh ← Append a wiki/log.md entry
│ ├── scoped-lint.sh ← Lint only the pages touched this run (reuses lib/lint.sh)
│ └── open-pr.sh ← Branch / commit / push / open PR (DRY_RUN seam for tests)
└── tests/ ← bats suite — deterministic, no LLM/GPU (see Testing)
├── helpers.bash
├── scripts.bats
├── lint.bats
├── structure.bats
└── run-ingest.bats
```
> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI
> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work
> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest).
---
## System Requirements
@ -156,7 +189,9 @@ All tools (git-crypt, bw, qmd) have native Linux binaries.
All scripts are compatible with macOS. Requirements:
- bash 3.2+ (macOS default) — fully supported. All `bash 4+` constructs removed.
- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding).
The `ingest` skill uses bash 4+ constructs (`mapfile`), but it is deployed and run on the
Linux AI node, not on the macOS setup machine — so this is not a constraint in practice.
- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
- `git-crypt`: install via Homebrew — `brew install git-crypt`
- `jq`, `curl`: pre-installed or via Homebrew
@ -195,6 +230,11 @@ The system is designed for a homelab architecture:
> the index, and the log tail is a cost. This is why all agent files are token-optimised
> and sessions are kept to one source at a time.
> **Reference deployment:** the table above is a target profile, not a hard requirement.
> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive
> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they
> just make the "one source per session" discipline and the token budget matter more.
---
## Prerequisites
@ -285,14 +325,17 @@ resolution. Never included by Make.
```bash
# Dynamic paths (resolved at source time)
WORK_DIR="${HOME}/knowledge-genome-setup"
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
KEYS_DIR="${WORK_DIR}/keys"
# Genome registry — format: "name|description"
# Genome registry — format: "name|description|linked_repo"
# The third field is OPTIONAL:
# - leave it empty → knowledge-only genome (no linked project)
# - owner/repo → genome is linked to that project repository (rendered into AGENTS.md)
GENOMES=(
"genome-dev|Web development, TUI, Angular, software architecture"
"genome-finance|Personal finance, investments, market analysis"
"genome-homelab|Infrastructure, network configs, architecture logs"
"genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app"
"genome-finance|Personal finance, investments, market analysis|"
"genome-homelab|Infrastructure, network configs, architecture logs|"
)
```
@ -315,8 +358,8 @@ export GITHUB_TOKEN="your_github_token"
```bash
# 1. Clone the setup framework
git clone <setup-repo-url> knowledge-genome-setup
cd knowledge-genome-setup
git clone <setup-repo-url> knowledge-genome-orchestrator
cd knowledge-genome-orchestrator
# 2. Configure your environment
cp globals.env.example globals.env # edit with your values
@ -358,16 +401,19 @@ After setup completes:
## Makefile Reference
| Target | Description |
| --------------------------------- | ------------------------------------------------------------------------------ |
| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` |
| `make add-genome NAME=x DESC="y"` | Scaffold and register a single new genome |
| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) |
| `make status` | Show submodule status and first 10 git-crypt encryption states |
| `make lock` | Lock all encrypted repos (master + all genome submodules) |
| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome |
| `make help` | Print all available targets |
| Target | Description |
| ----------------------------------------------------- | ------------------------------------------------------------------------------------- |
| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` |
| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project) |
| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) |
| `make verify-structure` | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`) |
| `make sync-structure` | Create any missing canonical directories across all genomes (safe, idempotent) |
| `make test` | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) |
| `make status` | Show submodule status and per-genome git-crypt encryption state |
| `make lock` | Lock all encrypted repos (master + all genome submodules) |
| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome |
| `make help` | Print all available targets |
### Examples
@ -378,6 +424,12 @@ make doctor
# Add a new genome after initial setup
make add-genome NAME=genome-research DESC="Academic papers and deep research"
# Add a genome linked to a project repository
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app
# Check every genome against the canonical directory layout
make verify-structure
# Run full lint pass (bash deterministic checks)
make lint
@ -390,6 +442,38 @@ make lock
---
## Testing
The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is
covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are
**deterministic and have zero dependency on the LLM, the GPU, or the network** — they
simulate the agent's output with fixtures and exercise the scripts directly, so they run
anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI
node or via n8n.
```bash
sudo apt install bats # once
make test # or: bats tests/
```
| File | Covers |
| ----------------- | ------------------------------------------------------------------------------ |
| `scripts.bats` | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) |
| `lint.bats` | `lib/lint.sh` validators + `scoped-lint.sh` |
| `structure.bats` | `lib/structure.sh` report / sync |
| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq` |
Each test builds its own throwaway genome with a local bare remote, configured to ignore
the operator's global git settings (signing, global hooks) so the suite is hermetic. The
`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in
`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match.
> Why this matters: the only non-deterministic part of the system is the model. Pinning
> the mechanical layer with tests means that when an ingest misbehaves, you know it's the
> model or the prompt — not the plumbing.
---
## Genome Lifecycle
### Initial setup
@ -431,6 +515,7 @@ template files:
| `{{GENOME_NAME}}` | registry.sh | `genome-dev` |
| `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` |
| `{{GENOME_DESC}}` | registry.sh | `Web development...` |
| `{{LINKED_PROJECT}}` | registry.sh | `myorg/my-app` (or `none`) |
| `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` |
| `{{FORGEJO_USER}}` | globals.env | `yourusername` |
| `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` |
@ -593,9 +678,9 @@ git clone https://git.yourserver.com/yourusername/genome-dev.git
If a key is lost or compromised:
```bash
# From the knowledge-genome-setup/ directory
# From the knowledge-genome-orchestrator/ directory
source lib/git-crypt.sh
cd ~/knowledge-genome-setup/genome-dev
cd ~/knowledge-genome-orchestrator/genome-dev
gcrypt_rotate_key "genome-dev"
```
@ -643,7 +728,8 @@ The agent executes in this order at the start of every session:
1. Read `wiki/index.md` — primary catalog of all pages and maturity
2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
3. For tasks involving related pages: `qmd search "<query>"` before opening any files
3. For tasks involving related pages: if the optional `qmd` extension is installed,
`qmd search "<query>"` before opening files; otherwise navigate from `wiki/index.md`
4. Operate on individual files — never scan entire directories
### One source per session
@ -668,7 +754,7 @@ For Forgejo webhook → automated ingest:
2. n8n receives webhook, identifies new files
3. n8n starts one agent session per new file (sequential, not parallel)
4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Agent ingest workflow runs, opens PR
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR
6. Human reviews and merges PR
---
@ -677,17 +763,39 @@ For Forgejo webhook → automated ingest:
### Ingest
Triggered by a new file in `raw/` (manual or via webhook).
Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two
phases so that the small local model spends its limited context only on judgement, and
all the deterministic bookkeeping happens outside the model's loop.
1. Read source once
2. Create `wiki/sources/<slug>.md` — summary and key points
3. Per entity (person, tool, organisation): create or update `wiki/entities/<name>.md`
4. Per concept (pattern, theory, decision): create or update `wiki/concepts/<name>.md`
5. Check each touched page for contradictions → apply Conflict Resolution if found
6. Append entry to `wiki/index.md` (bottom of relevant section — do not reorder)
7. Append log entry: `INGEST | <slug>`
8. Run scoped lint on pages created or modified in this session; report in PR
9. Commit on `feat/ai-ingest-<slug>`; open PR using `templates/pr-description.md`
**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools
only (no shell). It:
1. Reads the source once
2. Creates `wiki/sources/<slug>.md` — summary and key points
3. Per entity (person, tool, organisation): creates or updates `wiki/entities/<name>.md`
4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/<name>.md`
5. Checks each touched page for contradictions → applies Conflict Resolution if found
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
a one-line reasoning, the PR summary, and any contradictions) — then **stops**
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor
consumes the manifest and does the mechanical work the model must not waste context on:
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**
(`index-append.py`) and bumps the index `last_updated`
8. Appends the `INGEST | <slug>` entry to `wiki/log.md`
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
`lib/lint.sh`)
10. Commits on `feat/ai-ingest-<slug>` and opens the PR using `templates/pr-description.md`
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
The agent never runs git, never edits the index/log mechanically, and never lints — those
are deterministic and tested (see [Testing](#testing)). Invocation on the AI node:
```bash
pi --mode json -p "/skill:ingest raw/articles/<file>.md" # phase 1 → writes manifest
run-ingest.sh <genome> # phase 2 → index/log/lint/PR
```
For private sources (`PRIVATE_CONTEXT: enabled` required):
@ -698,7 +806,8 @@ For private sources (`PRIVATE_CONTEXT: enabled` required):
Triggered by an operator question.
1. `qmd search "<query>"` → identify candidate pages
1. `qmd search "<query>"` (if the optional qmd extension is installed) → identify
candidate pages; otherwise start from `wiki/index.md`
2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
3. Synthesise answer with `[[wikilink]]` citations
4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
@ -974,7 +1083,8 @@ n8n (running on the storage node) can automate the ingest pipeline:
2. n8n flow identifies new files
3. For each new file: starts one agent session (sequential — never parallel)
4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Agent runs ingest workflow and opens PR
5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 —
`run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n
6. Human reviews the PR
Key constraint: one source per session, sessions sequential.
@ -984,11 +1094,13 @@ Never batch multiple sources into one agent session.
If the AI compute node has an Intel NPU (e.g. Core Ultra series):
- Background tasks (embedding updates, index refresh) → Intel NPU via OpenVINO
- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd
re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO
- Active reasoning sessions (ingest, query, synthesis) → GPU
This keeps the GPU's KV cache free for interactive work and reduces power consumption
for background operations.
Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)),
so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the
GPU's KV cache free for interactive sessions and lowers power draw for background jobs.
---