No description

Find a file

Matteo Cherubini 7dcca2c807 Merge branch 'feature/wiki-structure-hardening' into develop		2026-05-10 11:26:45 +02:00
lib	docs: Strengthen git-crypt key export security warning	2026-05-10 10:10:58 +02:00
providers	refactor: Improve script robustness by returning from functions	2026-05-09 17:03:20 +02:00
scripts	refactor: Improve linting error accumulation and link checking	2026-05-10 10:11:14 +02:00
templates	chore: Update genome templates with new frontmatter fields	2026-05-09 11:34:26 +02:00
globals.env	refactor: Separate static and runtime configuration	2026-05-09 17:03:20 +02:00
Makefile	feat: Add 'doctor' and 'sync' Makefile targets	2026-05-09 17:03:20 +02:00
README.md	feat: Revamp README with new core philosophy and architecture	2026-05-08 22:10:30 +02:00
registry.sh	refactor: Improve script loading and project root resolution	2026-05-10 10:10:43 +02:00

README.md

Knowledge Genome System

A distributed, modular, and secure personal knowledge base — no vector database required.

The Knowledge Genome System implements the LLM Wiki pattern by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt encryption for sensitive data, and a human-in-the-loop Git Flow for quality control.

Core Philosophy

Most RAG systems make the LLM rediscover knowledge from scratch on every query. This system is different: the LLM incrementally builds and maintains a persistent wiki that sits between you and the raw sources. Knowledge is compiled once and kept current — not re-derived on every question.

This means: no vector database, no embedding pipeline, no external retrieval server. The wiki/index.md of each genome is the retrieval layer. At moderate scale (~100 sources, hundreds of pages) this works better than RAG because cross-references, contradictions, and syntheses are already resolved — the LLM doesn't have to piece them together at query time.

If the wiki grows beyond what the index can navigate efficiently, the only recommended search extension is qmd — a local, on-device BM25 + vector search engine for markdown files with an MCP server interface. No external infrastructure required.

Architecture

master-knowledge-genome/          ← Root orchestrator
├── core-karpathy/                ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/                   ← Submodule: web dev, Angular, TUI
├── genome-finance/               ← Submodule: personal finance
├── genome-homelab/               ← Submodule: Keru infrastructure
└── AGENTS.md                     ← Global coordination schema

Each genome is an independent repository with this structure:

genome-{name}/
├── raw/
│   ├── articles/ transcripts/ code-packs/ assets/   ← Plaintext, open to collaborators
│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
├── wiki/
│   ├── index.md  log.md                              ← Navigation and audit trail
│   ├── sources/ entities/ concepts/ queries/         ← Agent-maintained knowledge
│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
└── AGENTS.md                                         ← Per-genome agent contract

Prerequisites

Required:

git
git-crypt
curl
jq

Optional:

bw (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk

Install on Ubuntu/Debian:

sudo apt update && sudo apt install -y git git-crypt curl jq

Quick Start

# 1. Clone this setup repository
git clone <setup-repo-url> knowledge-genome-setup
cd knowledge-genome-setup

# 2. Export your Forgejo token
export FORGEJO_TOKEN="your_token_here"

# 3. Run full setup
make setup

make setup will:

Check all dependencies
Create the master and genome repositories on Forgejo
Scaffold the local directory structure with git-crypt active on private/
Install the pre-commit security hook in each genome
Export the symmetric git-crypt keys to keys/

Management Commands

Command	Description
`make setup`	Full system initialisation (master + all genomes defined in `config.env`)
`make add-genome NAME=x DESC="y"`	Scaffold and register a new genome
`make lint`	Validate schema, privacy flags, and metadata across all genomes
`make status`	Show git submodule status and first 10 git-crypt encryption states
`make help`	Show all available targets

Adding a new genome example:

make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research"

Security Model

Hybrid Privacy Architecture

Each genome has two layers:

Layer	Directories	Access
Public	`raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/`	Plaintext — safe for collaborators
Private	`raw/private/`, `wiki/private/`	AES-256-CTR via git-crypt — owner only

On the remote (Forgejo), private files are opaque binary blobs. Collaborators without the key can contribute normally to public directories — git handles the encrypted files transparently with no errors.

Runtime Key Injection

Encryption keys are never stored as persistent files on the AI server. They are injected at session start via the Bitwarden CLI (bw) against your self-hosted Vaultwarden instance, using process substitution:

# Key lives only in a kernel file descriptor — never touches disk
git-crypt unlock <(
  bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)

Use bw (standard Bitwarden CLI), not bws. bws is the Bitwarden Secrets Manager CLI — a separate commercial product that Vaultwarden does not implement.

Pre-commit Hook

A security hook is installed in every genome's .git/hooks/pre-commit. It inspects every staged file: if any file in raw/private/ or wiki/private/ is not encrypted by git-crypt, the commit is blocked with a clear error message explaining how to fix the issue.

Key Rotation

If a key is lost or compromised:

source lib/git-crypt.sh
cd ~/knowledge-genome-setup/genome-dev
gcrypt_rotate_key "genome-dev"

The function decrypts all private files, generates a new key, re-encrypts, and prints instructions for updating Vaultwarden.

Agent Interaction

At the start of every AI session, declare the privacy context explicitly:

PRIVATE_CONTEXT: disabled

The agent ignores all private/ directories. Outputs are safe to share.

PRIVATE_CONTEXT: enabled

The agent processes encrypted data. Requires the genome to be unlocked. All outputs referencing private data are prefixed with [PRIVATE DATA INCLUDED].

Knowledge Quality

The system includes three quality mechanisms drawn directly from the LLM Wiki pattern:

Conflict Resolution — when new evidence contradicts existing wiki content, the agent creates a wiki/queries/conflict-*.md node instead of silently overwriting. Human review required before merging.

Knowledge Decay — pages with maturity: stable not updated in 6 months, and maturity: draft pages not updated in 3 months, are flagged during lint passes with a ⚠️ STALE callout. The agent proposes re-validation but does not change maturity without new source evidence.

Cross-Genome Lint — once a month, a manual session passes the aggregated index of all genomes to the agent to detect concept duplication and missing cross-references. No automated LLM controller in CI/CD — the cost in tokens and complexity is not justified at this scale.