From 16a10decf324c59f862a7c159354576acce76bfb Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Fri, 8 May 2026 22:10:25 +0200 Subject: [PATCH] feat: Revamp README with new core philosophy and architecture --- README.md | 277 +++++++++++++++++++++++++++--------------------------- 1 file changed, 138 insertions(+), 139 deletions(-) diff --git a/README.md b/README.md index eedcf3f..c304199 100644 --- a/README.md +++ b/README.md @@ -1,201 +1,200 @@ # Knowledge Genome System -> A distributed, modular, and secure personal knowledge base architecture. +> A distributed, modular, and secure personal knowledge base — no vector database required. -The **Knowledge Genome System** is a framework designed to manage personal knowledge using a "Master-Genome" architecture. It follows the LLM-Wiki patterns (Karpathy-style) while adding a robust security layer for sensitive data and automated quality control. +The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) +by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt +encryption for sensitive data, and a human-in-the-loop Git Flow for quality control. --- -# Architecture +## Core Philosophy -This project is structured as a **Master Orchestrator** that manages multiple independent **Genomes** via Git Submodules. +Most RAG systems make the LLM rediscover knowledge from scratch on every query. +This system is different: the LLM **incrementally builds and maintains a persistent wiki** +that sits between you and the raw sources. Knowledge is compiled once and kept current — +not re-derived on every question. -## Core Components +**This means: no vector database, no embedding pipeline, no external retrieval server.** +The `wiki/index.md` of each genome is the retrieval layer. At moderate scale +(~100 sources, hundreds of pages) this works better than RAG because cross-references, +contradictions, and syntheses are already resolved — the LLM doesn't have to piece +them together at query time. -### Master Repository - -Contains: - -* Orchestration scripts -* Global configuration (`config.env`) -* Security templates - -### Genomes - -Individual specialized repositories (e.g. `genome-dev`, `genome-finance`) that act as standalone units of knowledge. - -### Security Layers - -#### Physical Security - -`git-crypt` encrypts `private/` directories at rest. - -#### Logical Security - -YAML frontmatter (`private: true`) prevents AI agents from leaking sensitive data during public sessions. - -#### Validation Layer - -A custom linting engine ensures metadata consistency. +If the wiki grows beyond what the index can navigate efficiently, the only recommended +search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device +BM25 + vector search engine for markdown files with an MCP server interface. +No external infrastructure required. --- -# Quick Start +## Architecture + +```text +master-knowledge-genome/ ← Root orchestrator +├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule) +├── genome-dev/ ← Submodule: web dev, Angular, TUI +├── genome-finance/ ← Submodule: personal finance +├── genome-homelab/ ← Submodule: Keru infrastructure +└── AGENTS.md ← Global coordination schema +``` + +Each genome is an independent repository with this structure: +```text +genome-{name}/ +├── raw/ +│ ├── articles/ transcripts/ code-packs/ assets/ ← Plaintext, open to collaborators +│ └── private/ ← AES-256-CTR encrypted (git-crypt) +├── wiki/ +│ ├── index.md log.md ← Navigation and audit trail +│ ├── sources/ entities/ concepts/ queries/ ← Agent-maintained knowledge +│ └── private/ ← AES-256-CTR encrypted (git-crypt) +└── AGENTS.md ← Per-genome agent contract +``` + +--- ## Prerequisites -Required dependencies: +**Required:** +- `git` +- `git-crypt` +- `curl` +- `jq` -* `git` -* `git-crypt` -* `curl` -* `jq` +**Optional:** +- `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk -Optional: - -* `bw` (Bitwarden CLI) — used for runtime key injection +Install on Ubuntu/Debian: +```bash +sudo apt update && sudo apt install -y git git-crypt curl jq +``` --- -## Initialization +## Quick Start ```bash -# 1. Clone the master repository -git clone && cd master-knowledge-genome +# 1. Clone this setup repository +git clone knowledge-genome-setup +cd knowledge-genome-setup -# 2. Run the full setup -# (checks dependencies, creates master scaffold, -# initializes genomes) +# 2. Export your Forgejo token +export FORGEJO_TOKEN="your_token_here" + +# 3. Run full setup make setup ``` -# Management Commands +`make setup` will: +- Check all dependencies +- Create the master and genome repositories on Forgejo +- Scaffold the local directory structure with git-crypt active on `private/` +- Install the pre-commit security hook in each genome +- Export the symmetric git-crypt keys to `keys/` -The system is controlled through a centralized Makefile. +--- -| Command | Description | -| ----------------- | -------------------------------------------------------------- | -| `make setup` | Full system initialization (Master + Registry Genomes). | -| `make add-genome` | Scaffolds and registers a new genome (requires NAME and DESC). | -| `make lint` | Runs the validation suite across all genomes. | -| `make status` | Checks Git status and encryption state for all submodules. | +## Management Commands -# Validation & Linting (`make lint`) +| Command | Description | +|---------|-------------| +| `make setup` | Full system initialisation (master + all genomes defined in `config.env`) | +| `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome | +| `make lint` | Validate schema, privacy flags, and metadata across all genomes | +| `make status` | Show git submodule status and first 10 git-crypt encryption states | +| `make help` | Show all available targets | -The built-in linter ensures that the knowledge base remains machine-readable and secure. - -It automatically validates: - -## Frontmatter Integrity - -Every `.md` file must contain valid YAML headers. - -## Domain Consistency - -Ensures that a file's domain metadata matches its parent genome. - -## Privacy Leak Detection - -Critical validation step. - -Verifies that any file located in a `/private/` directory contains the flag: - -```yaml -private: true +**Adding a new genome example:** +```bash +make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research" ``` -This prevents accidental exposure during AI sessions. +--- -## Broken Wiki-Links +## Security Model -Detects dead `[[internal-links]]`. +### Hybrid Privacy Architecture -# Security Model +Each genome has two layers: -## Hybrid Privacy Architecture +| Layer | Directories | Access | +|-------|-------------|--------| +| Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators | +| Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only | -Each genome is divided into two layers. +On the remote (Forgejo), private files are opaque binary blobs. +Collaborators without the key can contribute normally to public directories +— git handles the encrypted files transparently with no errors. -### Public Layer +### Runtime Key Injection -Directories: - -```text -raw/public/ -wiki/public/ -``` - -Characteristics: - -* Plaintext -* Shareable with collaborators - -### Private Layer - -Directories: - -```text -raw/private/ -wiki/private/ -``` - -Characteristics: - -* Encrypted using AES-256 via `git-crypt` - -## Runtime Key Injection - -To keep the AI environment secure, encryption keys are never stored on the VM disk. - -Instead, the system uses Bitwarden (`bw`) / Vaultwarden for runtime injection. - -### Example +Encryption keys are never stored as persistent files on the AI server. +They are injected at session start via the Bitwarden CLI (`bw`) against +your self-hosted Vaultwarden instance, using process substitution: ```bash -# Unlock a genome using a key stored in Vaultwarden +# Key lives only in a kernel file descriptor — never touches disk git-crypt unlock <( - bw get notes "genome-dev key" \ - --session "$BW_SESSION" | base64 -d + bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d ) ``` -# Genome Schema +**Use `bw` (standard Bitwarden CLI), not `bws`.** +`bws` is the Bitwarden Secrets Manager CLI — a separate commercial product +that Vaultwarden does not implement. -All wiki documents follow a strict schema to support AI ingestion. +### Pre-commit Hook -## YAML Frontmatter Schema +A security hook is installed in every genome's `.git/hooks/pre-commit`. +It inspects every staged file: if any file in `raw/private/` or `wiki/private/` +is not encrypted by git-crypt, the commit is blocked with a clear error message +explaining how to fix the issue. -```yaml ---- -title: "Document Title" -type: entity | concept | source | log -domain: genome-name -private: true/false -last_updated: YYYY-MM-DD ---- +### Key Rotation + +If a key is lost or compromised: +```bash +source lib/git-crypt.sh +cd ~/knowledge-genome-setup/genome-dev +gcrypt_rotate_key "genome-dev" ``` +The function decrypts all private files, generates a new key, re-encrypts, +and prints instructions for updating Vaultwarden. -# Agent Interaction +--- -When starting a session with an AI agent, always declare the privacy context. +## Agent Interaction -## Public Context +At the start of every AI session, declare the privacy context explicitly: ```text PRIVATE_CONTEXT: disabled ``` - -Behavior: - -* The agent ignores all private folders. - -## Private Context +The agent ignores all `private/` directories. Outputs are safe to share. ```text PRIVATE_CONTEXT: enabled ``` +The agent processes encrypted data. Requires the genome to be unlocked. +All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`. -Behavior: +--- -* The agent processes encrypted data. -* Requires the repository to be unlocked. +## Knowledge Quality + +The system includes three quality mechanisms drawn directly from the LLM Wiki pattern: + +**Conflict Resolution** — when new evidence contradicts existing wiki content, +the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting. +Human review required before merging. + +**Knowledge Decay** — pages with `maturity: stable` not updated in 6 months, +and `maturity: draft` pages not updated in 3 months, are flagged during lint passes +with a `⚠️ STALE` callout. The agent proposes re-validation but does not change +maturity without new source evidence. + +**Cross-Genome Lint** — once a month, a manual session passes the aggregated index +of all genomes to the agent to detect concept duplication and missing cross-references. +No automated LLM controller in CI/CD — the cost in tokens and complexity is not +justified at this scale.