# Knowledge Genome System > A distributed, modular, and secure personal knowledge base — no vector database required. The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt encryption for sensitive data, and a human-in-the-loop Git Flow for quality control. --- ## Core Philosophy Most RAG systems make the LLM rediscover knowledge from scratch on every query. This system is different: the LLM **incrementally builds and maintains a persistent wiki** that sits between you and the raw sources. Knowledge is compiled once and kept current — not re-derived on every question. **This means: no vector database, no embedding pipeline, no external retrieval server.** The `wiki/index.md` of each genome is the retrieval layer. At moderate scale (~100 sources, hundreds of pages) this works better than RAG because cross-references, contradictions, and syntheses are already resolved — the LLM doesn't have to piece them together at query time. If the wiki grows beyond what the index can navigate efficiently, the only recommended search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device BM25 + vector search engine for markdown files with an MCP server interface. No external infrastructure required. --- ## Architecture ```text master-knowledge-genome/ ← Root orchestrator ├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule) ├── genome-dev/ ← Submodule: web dev, Angular, TUI ├── genome-finance/ ← Submodule: personal finance ├── genome-homelab/ ← Submodule: Keru infrastructure └── AGENTS.md ← Global coordination schema ``` Each genome is an independent repository with this structure: ```text genome-{name}/ ├── raw/ │ ├── articles/ transcripts/ code-packs/ assets/ ← Plaintext, open to collaborators │ └── private/ ← AES-256-CTR encrypted (git-crypt) ├── wiki/ │ ├── index.md log.md ← Navigation and audit trail │ ├── sources/ entities/ concepts/ queries/ ← Agent-maintained knowledge │ └── private/ ← AES-256-CTR encrypted (git-crypt) └── AGENTS.md ← Per-genome agent contract ``` --- ## Prerequisites **Required:** - `git` - `git-crypt` - `curl` - `jq` **Optional:** - `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk Install on Ubuntu/Debian: ```bash sudo apt update && sudo apt install -y git git-crypt curl jq ``` --- ## Quick Start ```bash # 1. Clone this setup repository git clone knowledge-genome-setup cd knowledge-genome-setup # 2. Export your Forgejo token export FORGEJO_TOKEN="your_token_here" # 3. Run full setup make setup ``` `make setup` will: - Check all dependencies - Create the master and genome repositories on Forgejo - Scaffold the local directory structure with git-crypt active on `private/` - Install the pre-commit security hook in each genome - Export the symmetric git-crypt keys to `keys/` --- ## Management Commands | Command | Description | |---------|-------------| | `make setup` | Full system initialisation (master + all genomes defined in `config.env`) | | `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome | | `make lint` | Validate schema, privacy flags, and metadata across all genomes | | `make status` | Show git submodule status and first 10 git-crypt encryption states | | `make help` | Show all available targets | **Adding a new genome example:** ```bash make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research" ``` --- ## Security Model ### Hybrid Privacy Architecture Each genome has two layers: | Layer | Directories | Access | |-------|-------------|--------| | Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators | | Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only | On the remote (Forgejo), private files are opaque binary blobs. Collaborators without the key can contribute normally to public directories — git handles the encrypted files transparently with no errors. ### Runtime Key Injection Encryption keys are never stored as persistent files on the AI server. They are injected at session start via the Bitwarden CLI (`bw`) against your self-hosted Vaultwarden instance, using process substitution: ```bash # Key lives only in a kernel file descriptor — never touches disk git-crypt unlock <( bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d ) ``` **Use `bw` (standard Bitwarden CLI), not `bws`.** `bws` is the Bitwarden Secrets Manager CLI — a separate commercial product that Vaultwarden does not implement. ### Pre-commit Hook A security hook is installed in every genome's `.git/hooks/pre-commit`. It inspects every staged file: if any file in `raw/private/` or `wiki/private/` is not encrypted by git-crypt, the commit is blocked with a clear error message explaining how to fix the issue. ### Key Rotation If a key is lost or compromised: ```bash source lib/git-crypt.sh cd ~/knowledge-genome-setup/genome-dev gcrypt_rotate_key "genome-dev" ``` The function decrypts all private files, generates a new key, re-encrypts, and prints instructions for updating Vaultwarden. --- ## Agent Interaction At the start of every AI session, declare the privacy context explicitly: ```text PRIVATE_CONTEXT: disabled ``` The agent ignores all `private/` directories. Outputs are safe to share. ```text PRIVATE_CONTEXT: enabled ``` The agent processes encrypted data. Requires the genome to be unlocked. All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`. --- ## Knowledge Quality The system includes three quality mechanisms drawn directly from the LLM Wiki pattern: **Conflict Resolution** — when new evidence contradicts existing wiki content, the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting. Human review required before merging. **Knowledge Decay** — pages with `maturity: stable` not updated in 6 months, and `maturity: draft` pages not updated in 3 months, are flagged during lint passes with a `⚠️ STALE` callout. The agent proposes re-validation but does not change maturity without new source evidence. **Cross-Genome Lint** — once a month, a manual session passes the aggregated index of all genomes to the agent to detect concept duplication and missing cross-references. No automated LLM controller in CI/CD — the cost in tokens and complexity is not justified at this scale.