| lib | ||
| providers | ||
| scripts | ||
| templates | ||
| globals.env | ||
| Makefile | ||
| README.md | ||
| registry.sh | ||
Knowledge Genome System
A distributed, modular, and secure personal knowledge base — no vector database required.
The Knowledge Genome System implements the LLM Wiki pattern by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt encryption for sensitive data, and a human-in-the-loop Git Flow for quality control.
Core Philosophy
Most RAG systems make the LLM rediscover knowledge from scratch on every query. This system is different: the LLM incrementally builds and maintains a persistent wiki that sits between you and the raw sources. Knowledge is compiled once and kept current — not re-derived on every question.
This means: no vector database, no embedding pipeline, no external retrieval server.
The wiki/index.md of each genome is the retrieval layer. At moderate scale
(~100 sources, hundreds of pages) this works better than RAG because cross-references,
contradictions, and syntheses are already resolved — the LLM doesn't have to piece
them together at query time.
If the wiki grows beyond what the index can navigate efficiently, the only recommended
search extension is qmd — a local, on-device
BM25 + vector search engine for markdown files with an MCP server interface.
No external infrastructure required.
Architecture
master-knowledge-genome/ ← Root orchestrator
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/ ← Submodule: web dev, Angular, TUI
├── genome-finance/ ← Submodule: personal finance
├── genome-homelab/ ← Submodule: Keru infrastructure
└── AGENTS.md ← Global coordination schema
Each genome is an independent repository with this structure:
genome-{name}/
├── raw/
│ ├── articles/ transcripts/ code-packs/ assets/ ← Plaintext, open to collaborators
│ └── private/ ← AES-256-CTR encrypted (git-crypt)
├── wiki/
│ ├── index.md log.md ← Navigation and audit trail
│ ├── sources/ entities/ concepts/ queries/ ← Agent-maintained knowledge
│ └── private/ ← AES-256-CTR encrypted (git-crypt)
└── AGENTS.md ← Per-genome agent contract
Prerequisites
Required:
gitgit-cryptcurljq
Optional:
bw(Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk
Install on Ubuntu/Debian:
sudo apt update && sudo apt install -y git git-crypt curl jq
Quick Start
# 1. Clone this setup repository
git clone <setup-repo-url> knowledge-genome-setup
cd knowledge-genome-setup
# 2. Export your Forgejo token
export FORGEJO_TOKEN="your_token_here"
# 3. Run full setup
make setup
make setup will:
- Check all dependencies
- Create the master and genome repositories on Forgejo
- Scaffold the local directory structure with git-crypt active on
private/ - Install the pre-commit security hook in each genome
- Export the symmetric git-crypt keys to
keys/
Management Commands
| Command | Description |
|---|---|
make setup |
Full system initialisation (master + all genomes defined in config.env) |
make add-genome NAME=x DESC="y" |
Scaffold and register a new genome |
make lint |
Validate schema, privacy flags, and metadata across all genomes |
make status |
Show git submodule status and first 10 git-crypt encryption states |
make help |
Show all available targets |
Adding a new genome example:
make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research"
Security Model
Hybrid Privacy Architecture
Each genome has two layers:
| Layer | Directories | Access |
|---|---|---|
| Public | raw/articles/, raw/transcripts/, wiki/sources/, wiki/concepts/ |
Plaintext — safe for collaborators |
| Private | raw/private/, wiki/private/ |
AES-256-CTR via git-crypt — owner only |
On the remote (Forgejo), private files are opaque binary blobs. Collaborators without the key can contribute normally to public directories — git handles the encrypted files transparently with no errors.
Runtime Key Injection
Encryption keys are never stored as persistent files on the AI server.
They are injected at session start via the Bitwarden CLI (bw) against
your self-hosted Vaultwarden instance, using process substitution:
# Key lives only in a kernel file descriptor — never touches disk
git-crypt unlock <(
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)
Use bw (standard Bitwarden CLI), not bws.
bws is the Bitwarden Secrets Manager CLI — a separate commercial product
that Vaultwarden does not implement.
Pre-commit Hook
A security hook is installed in every genome's .git/hooks/pre-commit.
It inspects every staged file: if any file in raw/private/ or wiki/private/
is not encrypted by git-crypt, the commit is blocked with a clear error message
explaining how to fix the issue.
Key Rotation
If a key is lost or compromised:
source lib/git-crypt.sh
cd ~/knowledge-genome-setup/genome-dev
gcrypt_rotate_key "genome-dev"
The function decrypts all private files, generates a new key, re-encrypts, and prints instructions for updating Vaultwarden.
Agent Interaction
At the start of every AI session, declare the privacy context explicitly:
PRIVATE_CONTEXT: disabled
The agent ignores all private/ directories. Outputs are safe to share.
PRIVATE_CONTEXT: enabled
The agent processes encrypted data. Requires the genome to be unlocked.
All outputs referencing private data are prefixed with [PRIVATE DATA INCLUDED].
Knowledge Quality
The system includes three quality mechanisms drawn directly from the LLM Wiki pattern:
Conflict Resolution — when new evidence contradicts existing wiki content,
the agent creates a wiki/queries/conflict-*.md node instead of silently overwriting.
Human review required before merging.
Knowledge Decay — pages with maturity: stable not updated in 6 months,
and maturity: draft pages not updated in 3 months, are flagged during lint passes
with a ⚠️ STALE callout. The agent proposes re-validation but does not change
maturity without new source evidence.
Cross-Genome Lint — once a month, a manual session passes the aggregated index of all genomes to the agent to detect concept duplication and missing cross-references. No automated LLM controller in CI/CD — the cost in tokens and complexity is not justified at this scale.