From c66ff636ec9787c453894804ae6d87ddba41741e Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Mon, 11 May 2026 10:16:15 +0200 Subject: [PATCH 1/4] docs: Overhaul README and consolidate operational docs --- README.md | 1040 ++++++++++++++++++++++++++++++++---- templates/agents-master.md | 45 -- 2 files changed, 939 insertions(+), 146 deletions(-) diff --git a/README.md b/README.md index c304199..59e38d9 100644 --- a/README.md +++ b/README.md @@ -1,200 +1,1038 @@ # Knowledge Genome System -> A distributed, modular, and secure personal knowledge base — no vector database required. +> A distributed, encrypted, multi-domain personal knowledge base. +> No vector database. No embedding pipeline. No external retrieval server. -The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) -by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt -encryption for sensitive data, and a human-in-the-loop Git Flow for quality control. +Built on the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) +by Andrej Karpathy — extended with a multi-domain submodule architecture, +AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection, +and a human-in-the-loop Git Flow for quality control. + +--- + +## Table of Contents + +1. [Core Philosophy](#core-philosophy) +2. [Architecture](#architecture) +3. [System Requirements](#system-requirements) +4. [Prerequisites](#prerequisites) +5. [Configuration](#configuration) +6. [Quick Start](#quick-start) +7. [Makefile Reference](#makefile-reference) +8. [Genome Lifecycle](#genome-lifecycle) +9. [Security Model](#security-model) +10. [Key Management](#key-management) +11. [Agent Sessions](#agent-sessions) +12. [Workflows](#workflows) +13. [Knowledge Quality](#knowledge-quality) +14. [Knowledge Schema](#knowledge-schema) +15. [Collaboration Model](#collaboration-model) +16. [Optional Extensions](#optional-extensions) +17. [Troubleshooting](#troubleshooting) --- ## Core Philosophy Most RAG systems make the LLM rediscover knowledge from scratch on every query. -This system is different: the LLM **incrementally builds and maintains a persistent wiki** -that sits between you and the raw sources. Knowledge is compiled once and kept current — -not re-derived on every question. +A document is indexed; at query time, relevant chunks are retrieved; an answer is generated. +Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM +pieces it together from fragments every single time. + +This system is different. Instead of retrieval at query time, the LLM +**incrementally builds and maintains a persistent wiki** that sits between you and the raw +sources. When a new source arrives, the LLM reads it, extracts key information, updates +entity and concept pages, flags contradictions with existing claims, and strengthens the +evolving synthesis. Knowledge is compiled once and kept current. + +**The wiki is a compounding artifact.** Cross-references are already there. +Contradictions have been flagged. The synthesis already reflects everything ingested. + +This means: +- No vector database. +- No embedding pipeline. +- No external retrieval infrastructure. -**This means: no vector database, no embedding pipeline, no external retrieval server.** The `wiki/index.md` of each genome is the retrieval layer. At moderate scale -(~100 sources, hundreds of pages) this works better than RAG because cross-references, -contradictions, and syntheses are already resolved — the LLM doesn't have to piece -them together at query time. +(~100 sources, hundreds of pages) this performs better than RAG because cross-references, +contradictions, and syntheses are already resolved — not re-derived per query. -If the wiki grows beyond what the index can navigate efficiently, the only recommended -search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device -BM25 + vector search engine for markdown files with an MCP server interface. -No external infrastructure required. +The human's job: curate sources, direct analysis, ask good questions, review PRs. +The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency. --- ## Architecture +### Repository structure + ```text -master-knowledge-genome/ ← Root orchestrator -├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule) -├── genome-dev/ ← Submodule: web dev, Angular, TUI -├── genome-finance/ ← Submodule: personal finance -├── genome-homelab/ ← Submodule: Keru infrastructure -└── AGENTS.md ← Global coordination schema +master-knowledge-genome/ ← Root orchestrator (submodule registry) +├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule) +├── genome-dev/ ← Submodule: web development, Angular, TUI +├── genome-finance/ ← Submodule: personal finance, investments +├── genome-homelab/ ← Submodule: Keru infrastructure, network configs +└── AGENTS.md ← Global coordination schema (cross-genome rules) ``` -Each genome is an independent repository with this structure: +Each genome is an independent git repository: + ```text genome-{name}/ -├── raw/ -│ ├── articles/ transcripts/ code-packs/ assets/ ← Plaintext, open to collaborators -│ └── private/ ← AES-256-CTR encrypted (git-crypt) -├── wiki/ -│ ├── index.md log.md ← Navigation and audit trail -│ ├── sources/ entities/ concepts/ queries/ ← Agent-maintained knowledge -│ └── private/ ← AES-256-CTR encrypted (git-crypt) -└── AGENTS.md ← Per-genome agent contract +├── .gitattributes ← Encryption rules — **/private/** wildcard +├── .gitignore +├── .git/hooks/pre-commit ← Security hook (dynamic git check-attr) +├── AGENTS.md ← Per-genome agent contract and workflow rules +│ +├── raw/ ← Immutable sources — LLM reads, never writes +│ ├── articles/ ← Web clips, saved articles +│ ├── transcripts/ ← Audio/video transcripts +│ ├── code-packs/ ← Code snippets and repositories +│ ├── assets/ ← Images, PDFs, binary files +│ └── private/ ← AES-256-CTR encrypted — owner only +│ +└── wiki/ ← LLM-owned — agent creates and maintains + ├── index.md ← Primary catalog (read first every session) + ├── log.md ← Append-only operations ledger + ├── sources/ ← One page per processed raw source + ├── entities/ ← People, tools, organisations, projects + ├── concepts/ ← Patterns, theories, architectural decisions + ├── queries/ ← Preserved answers and conflict notes + └── private/ ← AES-256-CTR encrypted — owner only ``` +### Three layers + +| Layer | Path | Owner | Rule | +|-------|------|-------|------| +| Raw sources | `raw/` | Human | Immutable. LLM reads only. Never modified. | +| Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. | +| Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. | + +### Framework structure + +```text +knowledge-genome-setup/ ← This repository (setup tooling) +├── globals.env ← Static KEY=VALUE config (Make-includable) +├── registry.sh ← Bash-only: GENOMES array + dynamic paths +├── Makefile ← Entry point for all operations +├── lib/ +│ ├── output.sh ← Terminal helpers (colors, log levels) +│ ├── deps.sh ← Dependency validation +│ ├── scaffold.sh ← Template rendering engine +│ ├── lint.sh ← Per-file validation functions +│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate) +├── providers/ +│ ├── forgejo.sh ← Forgejo REST API provider +│ └── github.sh ← GitHub REST API provider +├── scripts/ +│ ├── setup.sh ← Main entry point +│ ├── setup-master.sh ← Master repo initialisation +│ ├── setup-genomes.sh ← Genome provisioning loop +│ ├── add-genome.sh ← Add a single new genome +│ └── lint-genomes.sh ← Quality control across all genomes +└── templates/ + ├── agents-genome.md ← Per-genome agent contract template + ├── agents-master.md ← Master coordination schema template + ├── wiki-index.md ← Index template (rendered per genome) + ├── wiki-log.md ← Log template (rendered per genome) + ├── pr-description.md ← PR review checklist template + ├── pre-commit.sh ← Security hook template + ├── gitattributes ← Git encryption rules template + └── gitignore ← Git ignore template +``` + +--- + +## System Requirements + +### Linux — full support (primary target) + +All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+. +All tools (git-crypt, bw, qmd) have native Linux binaries. + +### macOS — full support + +All scripts are compatible with macOS. Requirements: +- bash 3.2+ (macOS default) — fully supported. All `bash 4+` constructs removed. +- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled. +- `git-crypt`: install via Homebrew — `brew install git-crypt` +- `jq`, `curl`: pre-installed or via Homebrew + +If you use Homebrew bash (`brew install bash`), the scripts work identically to Linux. + +### Windows — WSL2 only + +**Git Bash and native Windows are not supported.** + +Reasons: +- `git-crypt` has no native Windows binary. +- Process substitution `<(...)` used for runtime key injection is not available + in Git Bash or PowerShell. +- Several bash builtins used throughout (`compgen`, `BASH_SOURCE`, arrays) are not + available outside a POSIX-compliant shell. + +**WSL2 (Windows Subsystem for Linux)** with Ubuntu gives full compatibility. +All setup and runtime operations work identically to native Linux inside WSL2. + +### Hardware recommendations + +The system is designed for a homelab architecture: + +| Component | Recommended | Role | +|-----------|-------------|------| +| Storage node | Any Linux server with NFS | Hosts Forgejo, stores genome repos | +| AI compute node | GPU server (16GB+ VRAM) | Runs local LLM agent sessions | +| VRAM | 16GB minimum | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache | +| Local LLM | 14B–32B quantised | Active wiki maintenance sessions | +| Large LLM | 70B (async) | Deep reflection, complex synthesis (scheduled, not interactive) | + +> **On VRAM constraints:** with a 16GB card and a 14B model, the KV cache budget +> is ~6GB — approximately 32k tokens of effective context. Every token in `AGENTS.md`, +> the index, and the log tail is a cost. This is why all agent files are token-optimised +> and sessions are kept to one source at a time. + --- ## Prerequisites -**Required:** -- `git` -- `git-crypt` -- `curl` -- `jq` +### Required -**Optional:** -- `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk +| Tool | Purpose | +|------|---------| +| `git` | Version control | +| `git-crypt` | Transparent file encryption | +| `curl` | REST API calls to Forgejo/GitHub | +| `jq` | JSON parsing | + +### Optional + +| Tool | Purpose | +|------|---------| +| `bw` | Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) | +| `qmd` | Local BM25 + vector search for Markdown files with MCP server interface | + +> **`bw` vs `bws`:** Use `bw` (standard Bitwarden CLI). `bws` is the Bitwarden +> Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement. + +### Install on Ubuntu/Debian -Install on Ubuntu/Debian: ```bash sudo apt update && sudo apt install -y git git-crypt curl jq ``` +### Install on macOS + +```bash +brew install git git-crypt curl jq +``` + +### Install Bitwarden CLI + +```bash +# Linux +npm install -g @bitwarden/cli + +# macOS +brew install bitwarden-cli +``` + +### Verify all tools + +```bash +make doctor +``` + +--- + +## Configuration + +Configuration is split into two files with distinct purposes: + +### `globals.env` — static KEY=VALUE + +Safe for `make include`, `docker-compose`, shell `source`, and any standard env parser. +Contains only simple scalar values — no bash syntax, no arrays. + +```bash +# Provider selection +PROVIDER=forgejo # forgejo | github + +# Forgejo (active when PROVIDER=forgejo) +FORGEJO_URL=https://git.yourserver.com +FORGEJO_USER=yourusername +FORGEJO_SSH_PORT=222 # Default for many homelab Forgejo setups; 22 for standard + +# GitHub (active when PROVIDER=github — uncomment to use) +# GITHUB_USER=your-username +# GITHUB_ORG=your-org # Optional: for org repos, overrides GITHUB_USER + +# Vaultwarden +VAULTWARDEN_URL=https://vault.yourserver.com + +# Master repository +MASTER_REPO=master-knowledge-genome +GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git +``` + +### `registry.sh` — bash runtime config + +Sourced by shell scripts only. Contains the genome registry array and dynamic path +resolution. Never included by Make. + +```bash +# Dynamic paths (resolved at source time) +WORK_DIR="${HOME}/knowledge-genome-setup" +KEYS_DIR="${WORK_DIR}/keys" + +# Genome registry — format: "name|description" +GENOMES=( + "genome-dev|Web development, TUI, Angular, software architecture" + "genome-finance|Personal finance, investments, market analysis" + "genome-homelab|Infrastructure, network configs, architecture logs" +) +``` + +To add a genome to the registry before running setup, append a line to `GENOMES`. +After initial setup, use `make add-genome` instead. + +### Tokens + +Tokens are never stored in config files. Export them in your shell before running setup: + +```bash +export FORGEJO_TOKEN="your_forgejo_token" +# or +export GITHUB_TOKEN="your_github_token" +``` + --- ## Quick Start ```bash -# 1. Clone this setup repository +# 1. Clone the setup framework git clone knowledge-genome-setup cd knowledge-genome-setup -# 2. Export your Forgejo token +# 2. Configure your environment +cp globals.env.example globals.env # edit with your values +# Edit registry.sh to define your genomes + +# 3. Export your provider token export FORGEJO_TOKEN="your_token_here" -# 3. Run full setup +# 4. Verify dependencies +make doctor + +# 5. Run full setup make setup ``` -`make setup` will: -- Check all dependencies -- Create the master and genome repositories on Forgejo -- Scaffold the local directory structure with git-crypt active on `private/` -- Install the pre-commit security hook in each genome -- Export the symmetric git-crypt keys to `keys/` +`make setup` executes in order: + +1. **Dependency check** — verifies all required tools are installed +2. **Git identity check** — warns if `user.name` / `user.email` are not configured +3. **Master repo** — creates `master-knowledge-genome` on Forgejo, scaffolds with + `AGENTS.md` and `README.md`, initialises git, adds `core-karpathy` as submodule, pushes +4. **Genome provisioning** — for each genome in `registry.sh`: + - Creates remote repository on Forgejo + - Adds it as a submodule in the master repo + - Initialises git-crypt (**before any files are created**) + - Scaffolds directory structure and renders all templates + - Installs pre-commit security hook + - Commits, pushes genome to remote + - Exports symmetric key to `keys/.key` + - Prints Vaultwarden upload instructions + - Commits submodule pointer in master repo + +After setup completes: +- Upload all files in `keys/` to Vaultwarden (see Key Management) +- Delete key files from disk: `rm keys/*.key` --- -## Management Commands +## Makefile Reference -| Command | Description | -|---------|-------------| -| `make setup` | Full system initialisation (master + all genomes defined in `config.env`) | -| `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome | -| `make lint` | Validate schema, privacy flags, and metadata across all genomes | -| `make status` | Show git submodule status and first 10 git-crypt encryption states | -| `make help` | Show all available targets | +| Target | Description | +|--------|-------------| +| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` | +| `make add-genome NAME=x DESC="y"` | Scaffold and register a single new genome | +| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) | +| `make status` | Show submodule status and first 10 git-crypt encryption states | +| `make lock` | Lock all encrypted repos (master + all genome submodules) | +| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing | +| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome | +| `make help` | Print all available targets | + +### Examples -**Adding a new genome example:** ```bash -make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research" +# Check system health +make doctor + +# Add a new genome after initial setup +make add-genome NAME=genome-research DESC="Academic papers and deep research" + +# Run full lint pass (bash deterministic checks) +make lint + +# Sync all nodes after pulling on another machine +make sync + +# Emergency lock — secures all repos before leaving a session +make lock ``` --- +## Genome Lifecycle + +### Initial setup + +All genomes defined in `registry.sh` are provisioned by `make setup`. + +### Adding a genome after initial setup + +```bash +make add-genome NAME=genome-newname DESC="Domain description" +``` + +This: creates the remote repo, adds it as a submodule, initialises git-crypt, +scaffolds the directory structure, installs the pre-commit hook, commits and pushes, +exports the key, and commits the submodule pointer in master. + +After adding: upload the new key to Vaultwarden and delete the key file. + +### Removing a genome + +Manual process: +```bash +# In master repo +git submodule deinit genome-name +git rm genome-name +git commit -m "chore: remove genome-name submodule" +git push +# Archive or delete the remote repository on Forgejo +``` + +### Template rendering + +When a genome is scaffolded, `render_template` replaces these placeholders in all +template files: + +| Placeholder | Source | Example | +|-------------|--------|---------| +| `{{GENOME_NAME}}` | registry.sh | `genome-dev` | +| `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` | +| `{{GENOME_DESC}}` | registry.sh | `Web development...` | +| `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` | +| `{{FORGEJO_USER}}` | globals.env | `yourusername` | +| `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` | +| `{{MASTER_REPO}}` | globals.env | `master-knowledge-genome` | +| `{{DATE}}` | runtime | `2026-05-11` | + +--- + ## Security Model -### Hybrid Privacy Architecture +### Encryption architecture -Each genome has two layers: +Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt. +Two directories in every genome are always encrypted: -| Layer | Directories | Access | -|-------|-------------|--------| -| Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators | -| Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only | +| Directory | Contents | On remote | +|-----------|----------|-----------| +| `raw/private/` | Sensitive source material | Opaque binary blob | +| `wiki/private/` | Private synthesis and notes | Opaque binary blob | -On the remote (Forgejo), private files are opaque binary blobs. -Collaborators without the key can contribute normally to public directories -— git handles the encrypted files transparently with no errors. +All other directories (`raw/articles/`, `wiki/sources/`, etc.) are plaintext. +Collaborators without the key can contribute to public directories normally — +git handles encrypted files transparently. -### Runtime Key Injection +### `.gitattributes` — dynamic encryption rules + +Encryption rules use a glob wildcard that catches any `private/` directory at +any depth in the repository — including directories created at runtime by the LLM: + +```gitattributes +# Text rules first +*.md text eol=lf +*.sh text eol=lf + +# Encryption rules LAST (later rules override per-attribute) +# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption +**/private/** filter=git-crypt diff=git-crypt -text +``` + +> Rule ordering matters: in `.gitattributes`, the last matching rule wins per attribute. +> Encryption rules must come after text rules so `-text` overrides `text eol=lf` +> for encrypted markdown files. + +### Pre-commit hook — dynamic validation + +The security hook installed at `.git/hooks/pre-commit` validates every staged file +dynamically — it reads encryption requirements from `.gitattributes` at runtime +rather than checking hardcoded paths: + +```bash +# For each staged file, check if git-crypt encryption is required +filter=$(git check-attr filter -- "$file" | sed 's/.*: //') +if [[ "$filter" == "git-crypt" ]]; then + # Verify the file is actually encrypted + if git-crypt status "$file" | grep -q "not encrypted"; then + # BLOCK THE COMMIT + fi +fi +``` + +This means: any file matching `**/private/**` in `.gitattributes` is protected, +including future `private/` directories created anywhere in the repo. +The hook never needs updating when the encryption rules change. + +### PRIVATE_CONTEXT toggle + +The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent +accesses encrypted directories. It must be declared explicitly by the operator +at the start of every session: + +```text +PRIVATE_CONTEXT: disabled ← Default. private/ directories are treated as non-existent. +PRIVATE_CONTEXT: enabled ← Agent may read/write private/. Requires git-crypt unlock. +``` + +Rules: +- Never inferred. Never carried over from a previous session. +- `enabled` requires the operator to confirm that `git-crypt unlock` has run on the host. +- Per-genome, per-session: enabling for `genome-finance` does NOT enable for `genome-dev`. +- Cloud LLM models: `PRIVATE_CONTEXT` must always be `disabled`. Private data never leaves the local network. +- All outputs derived from private data are prefixed `[PRIVATE DATA INCLUDED]`. +- Private synthesis goes exclusively to `wiki/private/` — never to public wiki paths. + +### Runtime key injection — zero disk policy Encryption keys are never stored as persistent files on the AI server. They are injected at session start via the Bitwarden CLI (`bw`) against your self-hosted Vaultwarden instance, using process substitution: ```bash -# Key lives only in a kernel file descriptor — never touches disk +# Step 1: authenticate +bw config server https://vault.yourserver.com +export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw) + +# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk) git-crypt unlock <( bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d ) ``` -**Use `bw` (standard Bitwarden CLI), not `bws`.** -`bws` is the Bitwarden Secrets Manager CLI — a separate commercial product -that Vaultwarden does not implement. +The key flows: Vaultwarden → `bw get notes` → `base64 -d` → kernel pipe → `git-crypt`. +At no point is the key written to any file on disk. -### Pre-commit Hook +Lock a genome when the session ends: +```bash +git-crypt lock +``` -A security hook is installed in every genome's `.git/hooks/pre-commit`. -It inspects every staged file: if any file in `raw/private/` or `wiki/private/` -is not encrypted by git-crypt, the commit is blocked with a clear error message -explaining how to fix the issue. +--- -### Key Rotation +## Key Management + +> This section is for the operator. These commands are never issued by the LLM agent. + +### Vaultwarden Secure Notes + +Each genome key is stored as a base64-encoded Secure Note in Vaultwarden: + +| Genome | Vaultwarden Note Name | +|--------|----------------------| +| `genome-dev` | `genome-dev key` | +| `genome-finance` | `genome-finance key` | +| `genome-homelab` | `genome-homelab key` | + +After `make setup` or `make add-genome`, key files are exported to `keys/`. +Upload procedure: + +```bash +# Encode the key +base64 < keys/genome-dev.key + +# Paste the output into a Vaultwarden Secure Note named "genome-dev key" +# Then delete the key file +rm keys/genome-dev.key +``` + +### Cloning on a new machine + +```bash +# Full clone with all submodules +git clone --recurse-submodules \ + https://git.yourserver.com/yourusername/master-knowledge-genome.git + +# Unlock a specific genome (with key file — development only) +cd master-knowledge-genome/genome-dev +git-crypt unlock /path/to/genome-dev.key + +# Unlock via Vaultwarden (recommended — no key on disk) +export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw) +git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d) + +# Sparse clone — collaborator who only needs one genome +git clone https://git.yourserver.com/yourusername/genome-dev.git +``` + +### Key rotation (emergency) If a key is lost or compromised: + ```bash +# From the knowledge-genome-setup/ directory source lib/git-crypt.sh cd ~/knowledge-genome-setup/genome-dev gcrypt_rotate_key "genome-dev" ``` -The function decrypts all private files, generates a new key, re-encrypts, -and prints instructions for updating Vaultwarden. + +`gcrypt_rotate_key` performs: +1. Unlocks repo with existing key +2. Removes old key material +3. Generates new symmetric key via `git-crypt init` +4. Re-stages and commits private files (encrypted with new key) +5. Exports new key to `keys/` +6. Prints Vaultwarden update instructions + +> **Limitation:** git history still contains blobs encrypted with the old key. +> Anyone with the old key and git history access can decrypt them. To purge old +> encrypted blobs from history: +> ```bash +> git filter-repo --invert-paths --path raw/private --path wiki/private +> git push --force origin main +> ``` +> This rewrites all commit hashes — coordinate with any collaborators first. + +After rotation: +- Upload new key to Vaultwarden (replace existing note) +- Delete both `keys/genome-dev.key` and `keys/genome-dev-rotated-*.key` from disk +- Revoke access from previous key holders --- -## Agent Interaction +## Agent Sessions -At the start of every AI session, declare the privacy context explicitly: +### Prerequisites for every session -```text -PRIVATE_CONTEXT: disabled +Before starting an LLM agent session on a genome: +1. The host (AI server) runs `git-crypt unlock` for the required genomes +2. The orchestrator prepares context: `tail -n 20 wiki/log.md` +3. Declare `PRIVATE_CONTEXT` state explicitly in the opening prompt + +### Session start protocol + +The agent executes in this order at the start of every session: + +1. Read `wiki/index.md` — primary catalog of all pages and maturity +2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly) +3. For tasks involving related pages: `qmd search ""` before opening any files +4. Operate on individual files — never scan entire directories + +### One source per session + +With a 14B model and ~6GB KV cache budget, long sessions degrade. +As the session extends, the context fills with pages already created, +attention dilutes, and later entities receive worse cross-references than earlier ones. + +**Hard rule: one source per session.** +If multiple sources are queued in `raw/`, process only the first. +Commit, close the session. The orchestrator (n8n or script) starts a new session +for the next source with a clean KV cache. + +For automated pipelines: if 5 files arrive in `raw/`, trigger 5 agent sessions +sequentially — not one session with 5 files. + +### n8n automation + +For Forgejo webhook → automated ingest: +1. Forgejo sends webhook on push to `raw/` +2. n8n receives webhook, identifies new files +3. n8n starts one agent session per new file (sequential, not parallel) +4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path +5. Agent ingest workflow runs, opens PR +6. Human reviews and merges PR + +--- + +## Workflows + +### Ingest + +Triggered by a new file in `raw/` (manual or via webhook). + +1. Read source once +2. Create `wiki/sources/.md` — summary and key points +3. Per entity (person, tool, organisation): create or update `wiki/entities/.md` +4. Per concept (pattern, theory, decision): create or update `wiki/concepts/.md` +5. Check each touched page for contradictions → apply Conflict Resolution if found +6. Append entry to `wiki/index.md` (bottom of relevant section — do not reorder) +7. Append log entry: `INGEST | ` +8. Run scoped lint on pages created or modified in this session; report in PR +9. Commit on `feat/ai-ingest-`; open PR using `templates/pr-description.md` + +For private sources (`PRIVATE_CONTEXT: enabled` required): +- All output goes to `wiki/private/.md` only +- PR title: `[PRIVATE] ingest: ` + +### Query + +Triggered by an operator question. + +1. `qmd search ""` → identify candidate pages +2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup) +3. Synthesise answer with `[[wikilink]]` citations +4. If answer is non-trivial: save as `wiki/queries/.md` and append to index +5. Append log entry: `QUERY | ` + +For general orientation without a specific query: read `wiki/index.md` directly. + +### Lint + +The lint workflow is split between deterministic bash checks and semantic LLM judgment. + +**Step 1 — operator runs bash linter:** +```bash +make lint ``` -The agent ignores all `private/` directories. Outputs are safe to share. -```text -PRIVATE_CONTEXT: enabled -``` -The agent processes encrypted data. Requires the genome to be unlocked. -All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`. +The bash linter checks automatically: +- YAML frontmatter validity (all mandatory fields present) +- Domain consistency (domain field matches genome name) +- Type validity (value from allowed list) +- Privacy consistency (`private/` directories have `private: true`) +- Page size (warn at 400 lines, error at 800 lines) +- Knowledge decay (stable > 180 days, draft > 90 days) +- Broken internal wikilinks (warnings only — cross-type links produce expected false positives) + +**Step 2 — operator provides bash output to LLM agent:** + +The agent applies semantic judgment to findings the bash linter cannot make: +- **Orphan pages** (from bash list): for each orphan, identify 1-3 existing pages + that should link to it; propose specific additions +- **Implicit concepts** (from bash term frequency list): determine if a candidate + term warrants a dedicated page; draft stub if yes +- **Duplicate concepts**: `qmd search ""` for suspected duplicates; + propose merge if confirmed +- **Maturity promotion**: pages with 2+ sources still marked `draft` → propose `stable` + +The agent reports all findings as a structured list. It does not modify files +without operator approval. Appends `LINT | ` log entry. --- ## Knowledge Quality -The system includes three quality mechanisms drawn directly from the LLM Wiki pattern: +### PR review workflow -**Conflict Resolution** — when new evidence contradicts existing wiki content, -the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting. -Human review required before merging. +Every agent session that modifies wiki pages opens a PR. +The PR description uses `templates/pr-description.md`: -**Knowledge Decay** — pages with `maturity: stable` not updated in 6 months, -and `maturity: draft` pages not updated in 3 months, are flagged during lint passes -with a `⚠️ STALE` callout. The agent proposes re-validation but does not change -maturity without new source evidence. +```markdown +## Summary +One sentence: goal of this session and source processed. -**Cross-Genome Lint** — once a month, a manual session passes the aggregated index -of all genomes to the agent to detect concept duplication and missing cross-references. -No automated LLM controller in CI/CD — the cost in tokens and complexity is not -justified at this scale. +## Pages Created +| Path | Type | Maturity | + +## Pages Modified +| Path | Change | + +## Contradictions Found +[ ] None / [ ] n conflict file(s) created + +## Private Data Accessed +[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes + +## Scoped Lint (post-ingest) +[ ] Frontmatter valid [ ] No broken links [ ] No issues found +``` + +This makes human review fast and structured: read the table, scan the diff, +approve or request changes. No exploration required to understand what the agent did. + +### Conflict resolution + +When new evidence contradicts an existing wiki claim: + +1. Keep the existing page unchanged +2. Create `wiki/queries/conflict--.md` with: + - The existing claim and its source + - The contradicting evidence and its source + - Agent confidence assessment for each + - Recommendation: `accept_b` | `keep_a` | `requires_human_review` +3. Add entry to `wiki/index.md` → Conflicts Pending Review section +4. Log entry: `CONFLICT | ` +5. Open PR: `[CONFLICT] — human review required` + +The operator resolves the conflict, updates relevant pages, closes the PR. + +### Knowledge decay + +Pages have a `last_updated` field in frontmatter. During lint passes: + +| Maturity | Threshold | Action | +|----------|-----------|--------| +| `stable` | 180 days | Flag as stale — add `⚠️ STALE` callout | +| `draft` | 90 days | Flag as stale — add `⚠️ STALE` callout | + +The agent proposes re-validation but does not change `maturity` without new source evidence. + +### Cross-genome lint + +A manual, monthly operation. Not automated in CI/CD — the token cost and coordination +complexity are not justified at this scale. + +1. Operator initiates a master-repo agent session +2. Agent uses `qmd search ""` across the multi-genome index to find: + - Concepts defined in 2+ genomes with potentially conflicting definitions + - Entities referenced cross-genome without canonical cross-genome wikilinks + - Concepts in genome-X that should link to genome-Y +3. Agent reports findings — does not modify files +4. For each finding: create conflict note in the genome where resolution belongs + +--- + +## Knowledge Schema + +### Frontmatter + +Every wiki page must start with valid YAML frontmatter: + +```yaml +--- +title: "Strict String Title" +type: source | entity | concept | query | conflict | private +domain: genome-name +tags: [lowercase, hyphen-separated] +maturity: draft | stable | deprecated +last_updated: YYYY-MM-DD +private: true | false +--- +``` + +| Field | Rules | +|-------|-------| +| `type` | Must be one of: `source entity concept query conflict private index log` | +| `maturity: draft` | Single source or unvalidated | +| `maturity: stable` | Confirmed by 2+ independent sources | +| `maturity: deprecated` | Superseded — add `> **DEPRECATED:** ` callout at top | +| `private: true` | Required on all pages in `wiki/private/` and `raw/private/` | + +Do not use semantic versioning for content. Git history tracks every change. +`maturity` captures epistemic state; `last_updated` tracks recency. + +### Page types and directories + +| Type | Directory | Description | +|------|-----------|-------------| +| `source` | `wiki/sources/` | One page per processed raw source | +| `entity` | `wiki/entities/` | People, tools, organisations, projects | +| `concept` | `wiki/concepts/` | Patterns, theories, architectural decisions | +| `query` | `wiki/queries/` | Preserved answers and analyses | +| `conflict` | `wiki/queries/conflict-*.md` | Unresolved contradictions | +| `private` | `wiki/private/` | Private synthesis (PRIVATE_CONTEXT: enabled) | +| `index` | `wiki/index.md` | Primary navigation catalog (singleton) | +| `log` | `wiki/log.md` | Operations ledger (singleton) | + +### Page size limits + +| Limit | Lines | Action | +|-------|-------|--------| +| Soft cap | 400 | Bash linter warns | +| Hard cap | 800 | Bash linter errors — split the page | + +These limits ensure pages fit within the LLM context window without attention degradation +and keep the wiki atomically navigable. + +### Linking conventions + +| Type | Format | +|------|--------| +| Internal (same genome) | `[[folder/slug]]` — Obsidian wikilinks only | +| Cross-genome | `[[../genome-target/wiki/folder/slug]]` | +| External | `[text](https://url)` — standard Markdown | + +Never use `[text](relative/path)` for internal references. Obsidian wikilinks are +bidirectional and appear in the graph view. + +### Log format + +Every operation appends one entry to `wiki/log.md`: + +```markdown +## [YYYY-MM-DD] TYPE | Subject + +- run_id: `` +- model: `` +- context_read: `[[path/A]]`, `[[path/B]]` +- output_written: `[[path/C]]` +- reasoning: One sentence — what changed and why. +``` + +Valid TYPEs: `INGEST` `LINT` `QUERY` `CONFLICT` `CONFIG` `SECURITY` + +Parse examples: +```bash +grep "^## \[" wiki/log.md | tail -5 # Last 5 entries +grep "^## \[" wiki/log.md | grep "CONFLICT" # All conflicts +grep "^## \[2026-05" wiki/log.md # Entries from a specific month +``` + +The orchestrator always injects only `tail -n 20 wiki/log.md` into agent context. +The LLM never loads the full log. + +--- + +## Collaboration Model + +| Role | Key access | Permitted operations | +|------|-----------|----------------------| +| Owner | Full — key holder | Read/write everywhere | +| Collaborator | None | Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/` | +| Local AI agent | Conditional | `private/` only when `PRIVATE_CONTEXT: enabled` | +| Cloud AI model | Never | `PRIVATE_CONTEXT` must be `disabled`; private data stays on local network | + +Grant collaborator access: add as Forgejo contributor with Write role. +Never share the git-crypt key — collaborators operate exclusively in public directories. + +--- + +## Optional Extensions + +### qmd — local Markdown search + +[qmd](https://github.com/tobi/qmd) is a local, on-device BM25 + vector search +engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls) +and an MCP server (for native LLM tool use). + +Recommended at scale: once a genome exceeds ~150 pages, `qmd search` is significantly +faster and more accurate than navigating `wiki/index.md` manually. + +```bash +# Index a genome +qmd index genome-dev/wiki/ + +# Search +qmd search "graph-based state management" + +# Start MCP server (for Claude Code / Codex integration) +qmd serve --port 3333 +``` + +### Obsidian integration + +Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault. + +Recommended setup: +- **Graph view** — visualise page connections; spot orphans and hubs instantly +- **Obsidian Web Clipper** — browser extension to clip articles directly to `raw/articles/` + as Markdown +- **Download attachments** — Settings → Hotkeys → "Download attachments for current file". + Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to `raw/assets/` +- **Dataview plugin** — query YAML frontmatter across the wiki; + `TABLE maturity, last_updated WHERE domain = "genome-dev"` generates dynamic tables +- **Marp plugin** — render Markdown as slide decks directly from wiki content + +Note: `.obsidian/` is in `.gitignore`. Workspace and plugin settings are local — not synced. + +### n8n automation + +n8n (running on the storage node) can automate the ingest pipeline: + +1. Forgejo webhook fires on push to a genome's `raw/` directory +2. n8n flow identifies new files +3. For each new file: starts one agent session (sequential — never parallel) +4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path +5. Agent runs ingest workflow and opens PR +6. Human reviews the PR + +Key constraint: one source per session, sessions sequential. +Never batch multiple sources into one agent session. + +### Intel NPU offloading + +If the AI compute node has an Intel NPU (e.g. Core Ultra series): + +- Background tasks (embedding updates, index refresh) → Intel NPU via OpenVINO +- Active reasoning sessions (ingest, query, synthesis) → GPU + +This keeps the GPU's KV cache free for interactive work and reduces power consumption +for background operations. + +--- + +## Troubleshooting + +### `git-crypt: command not found` + +```bash +# Ubuntu/Debian +sudo apt install git-crypt + +# macOS +brew install git-crypt +``` + +### `make setup` fails with "MISSING: jq" + +```bash +make doctor # identifies all missing tools +sudo apt install git git-crypt curl jq +``` + +### Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED" + +The staged file is in a path matching `**/private/**` but is not encrypted. + +Fix options: +1. Verify `.gitattributes` contains `**/private/** filter=git-crypt diff=git-crypt -text` +2. Run `git-crypt init` if git-crypt is not initialised in this repo +3. Run `git-crypt status` to check the encryption state of all files + +Never use `git commit --no-verify` to bypass this check. + +### `git-crypt status` shows files as "not encrypted" after init + +The `.gitattributes` rule must be committed before files in `private/` are staged. +If files were staged before `.gitattributes` was committed: + +```bash +git rm -r --cached raw/private/ wiki/private/ +git add raw/private/ wiki/private/ +git commit -m "fix: re-stage private files for encryption" +``` + +### Agent returns stale or missing cross-references + +Likely causes: +1. Session was too long — KV cache degraded. Use one source per session. +2. `wiki/index.md` was not read at session start — agent lacked the page catalog. +3. qmd index is stale — re-index: `qmd index /wiki/` + +### Submodules show as "modified" after `make sync` + +This is normal if genome repos have new commits. Update master's pointers: + +```bash +cd master-knowledge-genome +git add . +git commit -m "chore: update submodule pointers" +git push +``` + +### bw unlock fails + +Verify you are using `bw` (standard Bitwarden CLI), not `bws` (Secrets Manager CLI). +`bws` does not work with self-hosted Vaultwarden. + +```bash +bw --version # should print e.g. "2024.x.x" +bw config server https://vault.yourserver.com +bw login +``` diff --git a/templates/agents-master.md b/templates/agents-master.md index 2d4ca4d..8b6015d 100644 --- a/templates/agents-master.md +++ b/templates/agents-master.md @@ -86,48 +86,3 @@ Genome-level operations are governed by the genome's `AGENTS.md`, not this file. - Concepts in genome-X that should link to genome-Y but don't. 3. Report findings. Do not modify any files. 4. For each finding: create a conflict note in the genome where resolution belongs, following that genome's §Conflict procedure. - ---- - -## Reference Operations - -### Add a genome -```bash -make add-genome NAME=genome-newname DESC="Domain description" -``` -Then update the architecture diagram in this file. - -### Sync submodules -```bash -make sync -``` - -### Update core-karpathy reference -```bash -git submodule update --remote core-karpathy -git add core-karpathy -git commit -m "chore: update core-karpathy to latest gist" -git push -``` - -### Clone (full) -```bash -git clone --recurse-submodules \ - {{FORGEJO_URL}}/{{FORGEJO_USER}}/{{MASTER_REPO}}.git -``` -After cloning, unlock each genome on the host before starting an agent session. - -### Key rotation (emergency) -If a key is compromised: `gcrypt_rotate_key ""` from project root. -Update the Vaultwarden Secure Note with the new base64-encoded key. -Revoke access from previous key holders. - -### Key registry - -| Genome | Vaultwarden Secure Note | Temp key file | -|--------|------------------------|---------------| -| genome-dev | `genome-dev key` | `keys/genome-dev.key` | -| genome-finance | `genome-finance key` | `keys/genome-finance.key` | -| genome-homelab | `genome-homelab key` | `keys/genome-homelab.key` | - -Temp key files in `keys/` are post-export only. Delete after upload to Vaultwarden. From b44022151447583c0056c7317d320ecf29f43293 Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Mon, 11 May 2026 10:16:15 +0200 Subject: [PATCH 2/4] feat: Implement structured PR review workflow --- templates/agents-genome.md | 1 + templates/pr-description.md | 26 ++++++++++++++++++++++++++ 2 files changed, 27 insertions(+) create mode 100644 templates/pr-description.md diff --git a/templates/agents-genome.md b/templates/agents-genome.md index 2d9ec36..f561291 100644 --- a/templates/agents-genome.md +++ b/templates/agents-genome.md @@ -38,6 +38,7 @@ Session end or return to `disabled`: remind operator to run `git-crypt lock` on 5. Never commit to `main`. Branch per task; PR required; no self-merge. 6. Contradict, don't overwrite. New evidence contradicts existing claim → §Conflict. 7. Never commit plaintext to any path marked for encryption in `.gitattributes`. +8. Every PR must use `templates/pr-description.md`. Do not omit the tabular summary. ### NEVER - Load `wiki/log.md` in full — read only the tail injected by the orchestrator. diff --git a/templates/pr-description.md b/templates/pr-description.md new file mode 100644 index 0000000..59cb25f --- /dev/null +++ b/templates/pr-description.md @@ -0,0 +1,26 @@ +## Summary + + +## Pages Created +| Path | Type | Maturity | +|------|------|----------| +| `[[folder/slug]]` | entity / concept / source / query | draft | + +## Pages Modified +| Path | Change | +|------|--------| +| `[[folder/slug]]` | Added cross-reference to `[[other/page]]` | + +## Contradictions Found +- [ ] None +- [ ] `n` conflict file(s) created — listed below + +## Private Data Accessed +- [ ] No — `PRIVATE_CONTEXT: disabled` +- [ ] Yes — `PRIVATE_CONTEXT: enabled` · outputs in `wiki/private/` only + +## Scoped Lint (post-ingest) +- [ ] Frontmatter valid on all touched pages +- [ ] No broken wikilinks on touched pages +- [ ] No issues found +- [ ] Issues found (list): From d406a0a554e29b07fe00d9418570a8f8dd348e7f Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Mon, 11 May 2026 10:16:15 +0200 Subject: [PATCH 3/4] refactor(agent): Refine ingest, query, and lint workflows --- templates/agents-genome.md | 34 ++++++++++++++++++++-------------- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/templates/agents-genome.md b/templates/agents-genome.md index f561291..a5f7f86 100644 --- a/templates/agents-genome.md +++ b/templates/agents-genome.md @@ -59,10 +59,11 @@ Session end or return to `disabled`: remind operator to run `git-crypt lock` on Execute in this order before any file operation: -1. Read `wiki/index.md` — full catalog of all pages and their maturity. -2. Read the last 20 log entries injected by orchestrator — do not open `wiki/log.md` directly. -3. For any task involving related pages: `qmd search ""` before opening files. -4. Operate on individual target files. Never scan entire directories. +1. **One source per session.** If multiple sources are queued in `raw/`, process only the first. Commit, close session. The orchestrator starts a new session for the next source. +2. Read `wiki/index.md` — full catalog of all pages and their maturity. +3. Read the last 20 log entries injected by orchestrator — do not open `wiki/log.md` directly. +4. For any task involving related pages: `qmd search ""` before opening files. +5. Operate on individual target files. Never scan entire directories. --- @@ -78,7 +79,8 @@ Execute in this order before any file operation: 5. Check each touched page for contradictions → apply §Conflict if found. 6. Append entry to `wiki/index.md` (bottom of relevant section). 7. Append log entry: `INGEST | `. -8. Commit on `feat/ai-ingest-`. Open PR. +8. Run scoped lint on pages created or modified in this session. Report issues in PR description. Do not auto-fix. +9. Commit on `feat/ai-ingest-`. Open PR using `templates/pr-description.md`. *Private source* (`PRIVATE_CONTEXT: enabled` required): - All output → `wiki/private/.md` only. @@ -88,24 +90,28 @@ Execute in this order before any file operation: *Triggered by operator question.* 1. `qmd search ""` → identify candidate pages. -2. Read relevant pages via `wiki/index.md` catalog. +2. Read candidate pages directly. 3. Synthesize answer with `[[wikilink]]` citations. 4. If answer is non-trivial: save as `wiki/queries/.md`. 5. Append entry to `wiki/index.md` under Queries. 6. Append log entry: `QUERY | `. +*For general orientation without a specific query: read `wiki/index.md` directly.* + ### Lint -*Triggered by operator or schedule.* +*Triggered by operator with bash pre-scan output.* -Find and report — do not auto-fix without operator approval: +Pre-requisite: operator runs `bash scripts/lint-genomes.sh` and provides output to this session. +The script handles deterministically: broken links, knowledge decay, page size, frontmatter validation. -1. Orphan pages — no inbound `[[wikilink]]`. -2. Duplicate concepts — two pages covering same topic → propose merge. -3. Implicit concepts — term in 3+ pages with no dedicated page. -4. `maturity: draft` with 2+ sources → propose promote to `stable`. -5. Broken internal links. -6. Knowledge decay violations (§Decay). +Agent tasks — apply semantic judgment to bash findings + independent semantic checks: +1. **Orphan pages** (list from bash): for each orphan, identify 1-3 existing pages that should link to it. Propose specific link additions. +2. **Implicit concepts** (term list from bash): for each candidate term, determine if a dedicated page is warranted. If yes, draft stub. +3. **Duplicate concepts**: `qmd search ""` for suspected duplicates → propose merge if confirmed. +4. **`maturity: draft`** pages with 2+ sources cited → propose promote to `stable`. + +Report all findings as structured list. Do not modify files without operator approval. Append log entry: `LINT | `. --- From dccfe478a07f49dcf4ce2cfed6f5b28e13e12f69 Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Mon, 11 May 2026 18:57:54 +0200 Subject: [PATCH 4/4] Version update --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index 5a60e4d..09243a7 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,5 @@ # ============================================================================= -# Knowledge Genome - Makefile v. 0.3.0 +# Knowledge Genome - Makefile v. 1.0.0 # Orchestrates the setup and management of the knowledge base. # =============================================================================