diff --git a/README.md b/README.md index 5c75647..96fe9c2 100644 --- a/README.md +++ b/README.md @@ -19,16 +19,17 @@ and a human-in-the-loop Git Flow for quality control. 5. [Configuration](#configuration) 6. [Quick Start](#quick-start) 7. [Makefile Reference](#makefile-reference) -8. [Genome Lifecycle](#genome-lifecycle) -9. [Security Model](#security-model) -10. [Key Management](#key-management) -11. [Agent Sessions](#agent-sessions) -12. [Workflows](#workflows) -13. [Knowledge Quality](#knowledge-quality) -14. [Knowledge Schema](#knowledge-schema) -15. [Collaboration Model](#collaboration-model) -16. [Optional Extensions](#optional-extensions) -17. [Troubleshooting](#troubleshooting) +8. [Testing](#testing) +9. [Genome Lifecycle](#genome-lifecycle) +10. [Security Model](#security-model) +11. [Key Management](#key-management) +12. [Agent Sessions](#agent-sessions) +13. [Workflows](#workflows) +14. [Knowledge Quality](#knowledge-quality) +15. [Knowledge Schema](#knowledge-schema) +16. [Collaboration Model](#collaboration-model) +17. [Optional Extensions](#optional-extensions) +18. [Troubleshooting](#troubleshooting) --- @@ -110,10 +111,18 @@ genome-{name}/ | Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. | | Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. | +### Linked projects (optional) + +A genome can optionally declare a **linked project repository** — a separate repo where +the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app +repo). The link is recorded as a third field in the registry and rendered into the +genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and +behaves exactly as before. See [Configuration](#configuration). + ### Framework structure ```text -knowledge-genome-setup/ ← This repository (setup tooling) +knowledge-genome-orchestrator/ ← This repository (setup tooling) ├── globals.env ← Static KEY=VALUE config (Make-includable) ├── registry.sh ← Bash-only: GENOMES array + dynamic paths ├── Makefile ← Entry point for all operations @@ -121,6 +130,7 @@ knowledge-genome-setup/ ← This repository (setup tooling) │ ├── output.sh ← Terminal helpers (colors, log levels) │ ├── deps.sh ← Dependency validation │ ├── scaffold.sh ← Template rendering engine +│ ├── structure.sh ← Canonical genome layout (single source of truth) │ ├── lint.sh ← Per-file validation functions │ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate) ├── providers/ @@ -131,18 +141,41 @@ knowledge-genome-setup/ ← This repository (setup tooling) │ ├── setup-master.sh ← Master repo initialisation │ ├── setup-genomes.sh ← Genome provisioning loop │ ├── add-genome.sh ← Add a single new genome -│ └── lint-genomes.sh ← Quality control across all genomes -└── templates/ - ├── agents-genome.md ← Per-genome agent contract template - ├── agents-master.md ← Master coordination schema template - ├── wiki-index.md ← Index template (rendered per genome) - ├── wiki-log.md ← Log template (rendered per genome) - ├── pr-description.md ← PR review checklist template - ├── pre-commit.sh ← Security hook template - ├── gitattributes ← Git encryption rules template - └── gitignore ← Git ignore template +│ ├── lint-genomes.sh ← Quality control across all genomes +│ └── verify-genomes.sh ← Structure verify / --sync across all genomes +├── templates/ +│ ├── agents-genome.md ← Per-genome agent contract template +│ ├── agents-master.md ← Master coordination schema template +│ ├── readme-master.md ← Master repo README template +│ ├── wiki-index.md ← Index template (rendered per genome) +│ ├── wiki-log.md ← Log template (rendered per genome) +│ ├── pr-description.md ← PR review checklist template +│ ├── pre-commit.sh ← Security hook template +│ ├── gitattributes ← Git encryption rules template +│ └── gitignore ← Git ignore template +├── skills/ +│ └── ingest/ ← pi skill: deployed to the AI node (vm101) +│ ├── SKILL.md ← Semantic-only contract (read/edit, emits manifest) +│ ├── references/ ← On-demand reference docs for the agent +│ └── scripts/ ← Deterministic post-processor (runs outside the agent) +│ ├── run-ingest.sh ← Orchestrator: consumes the manifest, emits one JSON line +│ ├── slug.sh ← Slug normalisation +│ ├── index-append.py ← Sorted insert into wiki/index.md + last_updated bump +│ ├── log-append.sh ← Append a wiki/log.md entry +│ ├── scoped-lint.sh ← Lint only the pages touched this run (reuses lib/lint.sh) +│ └── open-pr.sh ← Branch / commit / push / open PR (DRY_RUN seam for tests) +└── tests/ ← bats suite — deterministic, no LLM/GPU (see Testing) + ├── helpers.bash + ├── scripts.bats + ├── lint.bats + ├── structure.bats + └── run-ingest.bats ``` +> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI +> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work +> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest). + --- ## System Requirements @@ -156,7 +189,9 @@ All tools (git-crypt, bw, qmd) have native Linux binaries. All scripts are compatible with macOS. Requirements: -- bash 3.2+ (macOS default) — fully supported. All `bash 4+` constructs removed. +- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding). + The `ingest` skill uses bash 4+ constructs (`mapfile`), but it is deployed and run on the + Linux AI node, not on the macOS setup machine — so this is not a constraint in practice. - GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled. - `git-crypt`: install via Homebrew — `brew install git-crypt` - `jq`, `curl`: pre-installed or via Homebrew @@ -195,6 +230,11 @@ The system is designed for a homelab architecture: > the index, and the log tail is a cost. This is why all agent files are token-optimised > and sessions are kept to one source at a time. +> **Reference deployment:** the table above is a target profile, not a hard requirement. +> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive +> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they +> just make the "one source per session" discipline and the token budget matter more. + --- ## Prerequisites @@ -285,14 +325,17 @@ resolution. Never included by Make. ```bash # Dynamic paths (resolved at source time) -WORK_DIR="${HOME}/knowledge-genome-setup" +WORK_DIR="${HOME}/knowledge-genome-orchestrator" KEYS_DIR="${WORK_DIR}/keys" -# Genome registry — format: "name|description" +# Genome registry — format: "name|description|linked_repo" +# The third field is OPTIONAL: +# - leave it empty → knowledge-only genome (no linked project) +# - owner/repo → genome is linked to that project repository (rendered into AGENTS.md) GENOMES=( - "genome-dev|Web development, TUI, Angular, software architecture" - "genome-finance|Personal finance, investments, market analysis" - "genome-homelab|Infrastructure, network configs, architecture logs" + "genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app" + "genome-finance|Personal finance, investments, market analysis|" + "genome-homelab|Infrastructure, network configs, architecture logs|" ) ``` @@ -315,8 +358,8 @@ export GITHUB_TOKEN="your_github_token" ```bash # 1. Clone the setup framework -git clone knowledge-genome-setup -cd knowledge-genome-setup +git clone knowledge-genome-orchestrator +cd knowledge-genome-orchestrator # 2. Configure your environment cp globals.env.example globals.env # edit with your values @@ -358,16 +401,19 @@ After setup completes: ## Makefile Reference -| Target | Description | -| --------------------------------- | ------------------------------------------------------------------------------ | -| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` | -| `make add-genome NAME=x DESC="y"` | Scaffold and register a single new genome | -| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) | -| `make status` | Show submodule status and first 10 git-crypt encryption states | -| `make lock` | Lock all encrypted repos (master + all genome submodules) | -| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing | -| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome | -| `make help` | Print all available targets | +| Target | Description | +| ----------------------------------------------------- | ------------------------------------------------------------------------------------- | +| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` | +| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project) | +| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) | +| `make verify-structure` | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`) | +| `make sync-structure` | Create any missing canonical directories across all genomes (safe, idempotent) | +| `make test` | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) | +| `make status` | Show submodule status and per-genome git-crypt encryption state | +| `make lock` | Lock all encrypted repos (master + all genome submodules) | +| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing | +| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome | +| `make help` | Print all available targets | ### Examples @@ -378,6 +424,12 @@ make doctor # Add a new genome after initial setup make add-genome NAME=genome-research DESC="Academic papers and deep research" +# Add a genome linked to a project repository +make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app + +# Check every genome against the canonical directory layout +make verify-structure + # Run full lint pass (bash deterministic checks) make lint @@ -390,6 +442,38 @@ make lock --- +## Testing + +The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is +covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are +**deterministic and have zero dependency on the LLM, the GPU, or the network** — they +simulate the agent's output with fixtures and exercise the scripts directly, so they run +anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI +node or via n8n. + +```bash +sudo apt install bats # once +make test # or: bats tests/ +``` + +| File | Covers | +| ----------------- | ------------------------------------------------------------------------------ | +| `scripts.bats` | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) | +| `lint.bats` | `lib/lint.sh` validators + `scoped-lint.sh` | +| `structure.bats` | `lib/structure.sh` report / sync | +| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq` | + +Each test builds its own throwaway genome with a local bare remote, configured to ignore +the operator's global git settings (signing, global hooks) so the suite is hermetic. The +`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in +`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match. + +> Why this matters: the only non-deterministic part of the system is the model. Pinning +> the mechanical layer with tests means that when an ingest misbehaves, you know it's the +> model or the prompt — not the plumbing. + +--- + ## Genome Lifecycle ### Initial setup @@ -431,6 +515,7 @@ template files: | `{{GENOME_NAME}}` | registry.sh | `genome-dev` | | `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` | | `{{GENOME_DESC}}` | registry.sh | `Web development...` | +| `{{LINKED_PROJECT}}` | registry.sh | `myorg/my-app` (or `none`) | | `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` | | `{{FORGEJO_USER}}` | globals.env | `yourusername` | | `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` | @@ -593,9 +678,9 @@ git clone https://git.yourserver.com/yourusername/genome-dev.git If a key is lost or compromised: ```bash -# From the knowledge-genome-setup/ directory +# From the knowledge-genome-orchestrator/ directory source lib/git-crypt.sh -cd ~/knowledge-genome-setup/genome-dev +cd ~/knowledge-genome-orchestrator/genome-dev gcrypt_rotate_key "genome-dev" ``` @@ -643,7 +728,8 @@ The agent executes in this order at the start of every session: 1. Read `wiki/index.md` — primary catalog of all pages and maturity 2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly) -3. For tasks involving related pages: `qmd search ""` before opening any files +3. For tasks involving related pages: if the optional `qmd` extension is installed, + `qmd search ""` before opening files; otherwise navigate from `wiki/index.md` 4. Operate on individual files — never scan entire directories ### One source per session @@ -668,7 +754,7 @@ For Forgejo webhook → automated ingest: 2. n8n receives webhook, identifies new files 3. n8n starts one agent session per new file (sequential, not parallel) 4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path -5. Agent ingest workflow runs, opens PR +5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR 6. Human reviews and merges PR --- @@ -677,17 +763,39 @@ For Forgejo webhook → automated ingest: ### Ingest -Triggered by a new file in `raw/` (manual or via webhook). +Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two +phases so that the small local model spends its limited context only on judgement, and +all the deterministic bookkeeping happens outside the model's loop. -1. Read source once -2. Create `wiki/sources/.md` — summary and key points -3. Per entity (person, tool, organisation): create or update `wiki/entities/.md` -4. Per concept (pattern, theory, decision): create or update `wiki/concepts/.md` -5. Check each touched page for contradictions → apply Conflict Resolution if found -6. Append entry to `wiki/index.md` (bottom of relevant section — do not reorder) -7. Append log entry: `INGEST | ` -8. Run scoped lint on pages created or modified in this session; report in PR -9. Commit on `feat/ai-ingest-`; open PR using `templates/pr-description.md` +**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools +only (no shell). It: + +1. Reads the source once +2. Creates `wiki/sources/.md` — summary and key points +3. Per entity (person, tool, organisation): creates or updates `wiki/entities/.md` +4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/.md` +5. Checks each touched page for contradictions → applies Conflict Resolution if found +6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name, + a one-line reasoning, the PR summary, and any contradictions) — then **stops** + +**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor +consumes the manifest and does the mechanical work the model must not waste context on: + +7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order** + (`index-append.py`) and bumps the index `last_updated` +8. Appends the `INGEST | ` entry to `wiki/log.md` +9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing + `lib/lint.sh`) +10. Commits on `feat/ai-ingest-` and opens the PR using `templates/pr-description.md` +11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n + +The agent never runs git, never edits the index/log mechanically, and never lints — those +are deterministic and tested (see [Testing](#testing)). Invocation on the AI node: + +```bash +pi --mode json -p "/skill:ingest raw/articles/.md" # phase 1 → writes manifest +run-ingest.sh # phase 2 → index/log/lint/PR +``` For private sources (`PRIVATE_CONTEXT: enabled` required): @@ -698,7 +806,8 @@ For private sources (`PRIVATE_CONTEXT: enabled` required): Triggered by an operator question. -1. `qmd search ""` → identify candidate pages +1. `qmd search ""` (if the optional qmd extension is installed) → identify + candidate pages; otherwise start from `wiki/index.md` 2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup) 3. Synthesise answer with `[[wikilink]]` citations 4. If answer is non-trivial: save as `wiki/queries/.md` and append to index @@ -974,7 +1083,8 @@ n8n (running on the storage node) can automate the ingest pipeline: 2. n8n flow identifies new files 3. For each new file: starts one agent session (sequential — never parallel) 4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path -5. Agent runs ingest workflow and opens PR +5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 — + `run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n 6. Human reviews the PR Key constraint: one source per session, sessions sequential. @@ -984,11 +1094,13 @@ Never batch multiple sources into one agent session. If the AI compute node has an Intel NPU (e.g. Core Ultra series): -- Background tasks (embedding updates, index refresh) → Intel NPU via OpenVINO +- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd + re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO - Active reasoning sessions (ingest, query, synthesis) → GPU -This keeps the GPU's KV cache free for interactive work and reduces power consumption -for background operations. +Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)), +so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the +GPU's KV cache free for interactive sessions and lowers power draw for background jobs. ---