feat: Revamp README with new core philosophy and architecture

2026-05-08 22:10:25 +02:00 · 2026-05-08 22:10:25 +02:00 · 16a10decf3
commit 16a10decf3
parent a797fb2f10
1 changed files with 138 additions and 139 deletions
--- a/README.md
+++ b/README.md
@ -1,201 +1,200 @@
 # Knowledge Genome System

-> A distributed, modular, and secure personal knowledge base architecture.
+> A distributed, modular, and secure personal knowledge base — no vector database required.

-The **Knowledge Genome System** is a framework designed to manage personal knowledge using a "Master-Genome" architecture. It follows the LLM-Wiki patterns (Karpathy-style) while adding a robust security layer for sensitive data and automated quality control.
+The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
+by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt
+encryption for sensitive data, and a human-in-the-loop Git Flow for quality control.

 ---

-# Architecture
+## Core Philosophy

-This project is structured as a **Master Orchestrator** that manages multiple independent **Genomes** via Git Submodules.
+Most RAG systems make the LLM rediscover knowledge from scratch on every query.
+This system is different: the LLM **incrementally builds and maintains a persistent wiki**
+that sits between you and the raw sources. Knowledge is compiled once and kept current —
+not re-derived on every question.

-## Core Components
+**This means: no vector database, no embedding pipeline, no external retrieval server.**
+The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
+(~100 sources, hundreds of pages) this works better than RAG because cross-references,
+contradictions, and syntheses are already resolved — the LLM doesn't have to piece
+them together at query time.

-### Master Repository
-
-Contains:
-
-* Orchestration scripts
-* Global configuration (`config.env`)
-* Security templates
-
-### Genomes
-
-Individual specialized repositories (e.g. `genome-dev`, `genome-finance`) that act as standalone units of knowledge.
-
-### Security Layers
-
-#### Physical Security
-
-`git-crypt` encrypts `private/` directories at rest.
-
-#### Logical Security
-
-YAML frontmatter (`private: true`) prevents AI agents from leaking sensitive data during public sessions.
-
-#### Validation Layer
-
-A custom linting engine ensures metadata consistency.
+If the wiki grows beyond what the index can navigate efficiently, the only recommended
+search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device
+BM25 + vector search engine for markdown files with an MCP server interface.
+No external infrastructure required.

 ---

-# Quick Start
+## Architecture
+
+```text
+master-knowledge-genome/          ← Root orchestrator
+├── core-karpathy/                ← LLM Wiki reference pattern (read-only submodule)
+├── genome-dev/                   ← Submodule: web dev, Angular, TUI
+├── genome-finance/               ← Submodule: personal finance
+├── genome-homelab/               ← Submodule: Keru infrastructure
+└── AGENTS.md                     ← Global coordination schema
+```
+
+Each genome is an independent repository with this structure:
+```text
+genome-{name}/
+├── raw/
+│   ├── articles/ transcripts/ code-packs/ assets/   ← Plaintext, open to collaborators
+│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
+├── wiki/
+│   ├── index.md  log.md                              ← Navigation and audit trail
+│   ├── sources/ entities/ concepts/ queries/         ← Agent-maintained knowledge
+│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
+└── AGENTS.md                                         ← Per-genome agent contract
+```
+
+---

 ## Prerequisites

-Required dependencies:
+**Required:**
+- `git`
+- `git-crypt`
+- `curl`
+- `jq`

-* `git`
-* `git-crypt`
-* `curl`
-* `jq`
+**Optional:**
+- `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk

-Optional:
-
-* `bw` (Bitwarden CLI) — used for runtime key injection
+Install on Ubuntu/Debian:
+```bash
+sudo apt update && sudo apt install -y git git-crypt curl jq
+```

 ---

-## Initialization
+## Quick Start

 ```bash
-# 1. Clone the master repository
-git clone <master-repo-url> && cd master-knowledge-genome
+# 1. Clone this setup repository
+git clone <setup-repo-url> knowledge-genome-setup
+cd knowledge-genome-setup

-# 2. Run the full setup
-#    (checks dependencies, creates master scaffold,
-#    initializes genomes)
+# 2. Export your Forgejo token
+export FORGEJO_TOKEN="your_token_here"
+
+# 3. Run full setup
 make setup
 ```

-# Management Commands
+`make setup` will:
+- Check all dependencies
+- Create the master and genome repositories on Forgejo
+- Scaffold the local directory structure with git-crypt active on `private/`
+- Install the pre-commit security hook in each genome
+- Export the symmetric git-crypt keys to `keys/`

-The system is controlled through a centralized Makefile.
+---

-| Command           | Description                                                    |
-| ----------------- | -------------------------------------------------------------- |
-| `make setup`      | Full system initialization (Master + Registry Genomes).        |
-| `make add-genome` | Scaffolds and registers a new genome (requires NAME and DESC). |
-| `make lint`       | Runs the validation suite across all genomes.                  |
-| `make status`     | Checks Git status and encryption state for all submodules.     |
+## Management Commands

-# Validation & Linting (`make lint`)
+| Command | Description |
+|---------|-------------|
+| `make setup` | Full system initialisation (master + all genomes defined in `config.env`) |
+| `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome |
+| `make lint` | Validate schema, privacy flags, and metadata across all genomes |
+| `make status` | Show git submodule status and first 10 git-crypt encryption states |
+| `make help` | Show all available targets |

-The built-in linter ensures that the knowledge base remains machine-readable and secure.
-
-It automatically validates:
-
-## Frontmatter Integrity
-
-Every `.md` file must contain valid YAML headers.
-
-## Domain Consistency
-
-Ensures that a file's domain metadata matches its parent genome.
-
-## Privacy Leak Detection
-
-Critical validation step.
-
-Verifies that any file located in a `/private/` directory contains the flag:
-
-```yaml
-private: true
+**Adding a new genome example:**
+```bash
+make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research"
 ```

-This prevents accidental exposure during AI sessions.
+---

-## Broken Wiki-Links
+## Security Model

-Detects dead `[[internal-links]]`.
+### Hybrid Privacy Architecture

-# Security Model
+Each genome has two layers:

-## Hybrid Privacy Architecture
+| Layer | Directories | Access |
+|-------|-------------|--------|
+| Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators |
+| Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only |

-Each genome is divided into two layers.
+On the remote (Forgejo), private files are opaque binary blobs.
+Collaborators without the key can contribute normally to public directories
+— git handles the encrypted files transparently with no errors.

-### Public Layer
+### Runtime Key Injection

-Directories:
-
-```text
-raw/public/
-wiki/public/
-```
-
-Characteristics:
-
-* Plaintext
-* Shareable with collaborators
-
-### Private Layer
-
-Directories:
-
-```text
-raw/private/
-wiki/private/
-```
-
-Characteristics:
-
-* Encrypted using AES-256 via `git-crypt`
-
-## Runtime Key Injection
-
-To keep the AI environment secure, encryption keys are never stored on the VM disk.
-
-Instead, the system uses Bitwarden (`bw`) / Vaultwarden for runtime injection.
-
-### Example
+Encryption keys are never stored as persistent files on the AI server.
+They are injected at session start via the Bitwarden CLI (`bw`) against
+your self-hosted Vaultwarden instance, using process substitution:

 ```bash
-# Unlock a genome using a key stored in Vaultwarden
+# Key lives only in a kernel file descriptor — never touches disk
 git-crypt unlock <(
-  bw get notes "genome-dev key" \
-    --session "$BW_SESSION" | base64 -d
+  bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
 )
 ```

-# Genome Schema
+**Use `bw` (standard Bitwarden CLI), not `bws`.**
+`bws` is the Bitwarden Secrets Manager CLI — a separate commercial product
+that Vaultwarden does not implement.

-All wiki documents follow a strict schema to support AI ingestion.
+### Pre-commit Hook

-## YAML Frontmatter Schema
+A security hook is installed in every genome's `.git/hooks/pre-commit`.
+It inspects every staged file: if any file in `raw/private/` or `wiki/private/`
+is not encrypted by git-crypt, the commit is blocked with a clear error message
+explaining how to fix the issue.

-```yaml
---
-title: "Document Title"
-type: entity | concept | source | log
-domain: genome-name
-private: true/false
-last_updated: YYYY-MM-DD
---
+### Key Rotation
+
+If a key is lost or compromised:
+```bash
+source lib/git-crypt.sh
+cd ~/knowledge-genome-setup/genome-dev
+gcrypt_rotate_key "genome-dev"
 ```
+The function decrypts all private files, generates a new key, re-encrypts,
+and prints instructions for updating Vaultwarden.

-# Agent Interaction
+---

-When starting a session with an AI agent, always declare the privacy context.
+## Agent Interaction

-## Public Context
+At the start of every AI session, declare the privacy context explicitly:

 ```text
 PRIVATE_CONTEXT: disabled
 ```
-
-Behavior:
-
-* The agent ignores all private folders.
-
-## Private Context
+The agent ignores all `private/` directories. Outputs are safe to share.

 ```text
 PRIVATE_CONTEXT: enabled
 ```
+The agent processes encrypted data. Requires the genome to be unlocked.
+All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`.

-Behavior:
+---

-* The agent processes encrypted data.
-* Requires the repository to be unlocked.
+## Knowledge Quality
+
+The system includes three quality mechanisms drawn directly from the LLM Wiki pattern:
+
+**Conflict Resolution** — when new evidence contradicts existing wiki content,
+the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting.
+Human review required before merging.
+
+**Knowledge Decay** — pages with `maturity: stable` not updated in 6 months,
+and `maturity: draft` pages not updated in 3 months, are flagged during lint passes
+with a `⚠️ STALE` callout. The agent proposes re-validation but does not change
+maturity without new source evidence.
+
+**Cross-Genome Lint** — once a month, a manual session passes the aggregated index
+of all genomes to the agent to detect concept duplication and missing cross-references.
+No automated LLM controller in CI/CD — the cost in tokens and complexity is not
+justified at this scale.