knowledge-genome-orchestrator/README.md

# Knowledge Genome System

> A distributed, modular, and secure personal knowledge base — no vector database required.

The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt
encryption for sensitive data, and a human-in-the-loop Git Flow for quality control.

---

## Core Philosophy

Most RAG systems make the LLM rediscover knowledge from scratch on every query.
This system is different: the LLM **incrementally builds and maintains a persistent wiki**
that sits between you and the raw sources. Knowledge is compiled once and kept current —
not re-derived on every question.

**This means: no vector database, no embedding pipeline, no external retrieval server.**
The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
(~100 sources, hundreds of pages) this works better than RAG because cross-references,
contradictions, and syntheses are already resolved — the LLM doesn't have to piece
them together at query time.

If the wiki grows beyond what the index can navigate efficiently, the only recommended
search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device
BM25 + vector search engine for markdown files with an MCP server interface.
No external infrastructure required.

---

## Architecture

```text
master-knowledge-genome/          ← Root orchestrator
├── core-karpathy/                ← LLM Wiki reference pattern (read-only submodule)
├── genome-dev/                   ← Submodule: web dev, Angular, TUI
├── genome-finance/               ← Submodule: personal finance
├── genome-homelab/               ← Submodule: Keru infrastructure
└── AGENTS.md                     ← Global coordination schema
```

Each genome is an independent repository with this structure:
```text
genome-{name}/
├── raw/
│   ├── articles/ transcripts/ code-packs/ assets/   ← Plaintext, open to collaborators
│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
├── wiki/
│   ├── index.md  log.md                              ← Navigation and audit trail
│   ├── sources/ entities/ concepts/ queries/         ← Agent-maintained knowledge
│   └── private/                                      ← AES-256-CTR encrypted (git-crypt)
└── AGENTS.md                                         ← Per-genome agent contract
```

---

## Prerequisites

**Required:**
- `git`
- `git-crypt`
- `curl`
- `jq`

**Optional:**
- `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk

Install on Ubuntu/Debian:
```bash
sudo apt update && sudo apt install -y git git-crypt curl jq
```

---

## Quick Start

```bash
# 1. Clone this setup repository
git clone <setup-repo-url> knowledge-genome-setup
cd knowledge-genome-setup

# 2. Export your Forgejo token
export FORGEJO_TOKEN="your_token_here"

# 3. Run full setup
make setup
```

`make setup` will:
- Check all dependencies
- Create the master and genome repositories on Forgejo
- Scaffold the local directory structure with git-crypt active on `private/`
- Install the pre-commit security hook in each genome
- Export the symmetric git-crypt keys to `keys/`

---

## Management Commands

| Command | Description |
|---------|-------------|
| `make setup` | Full system initialisation (master + all genomes defined in `config.env`) |
| `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome |
| `make lint` | Validate schema, privacy flags, and metadata across all genomes |
| `make status` | Show git submodule status and first 10 git-crypt encryption states |
| `make help` | Show all available targets |

**Adding a new genome example:**
```bash
make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research"
```

---

## Security Model

### Hybrid Privacy Architecture

Each genome has two layers:

| Layer | Directories | Access |
|-------|-------------|--------|
| Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators |
| Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only |

On the remote (Forgejo), private files are opaque binary blobs.
Collaborators without the key can contribute normally to public directories
— git handles the encrypted files transparently with no errors.

### Runtime Key Injection

Encryption keys are never stored as persistent files on the AI server.
They are injected at session start via the Bitwarden CLI (`bw`) against
your self-hosted Vaultwarden instance, using process substitution:

```bash
# Key lives only in a kernel file descriptor — never touches disk
git-crypt unlock <(
  bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
)
```

**Use `bw` (standard Bitwarden CLI), not `bws`.**
`bws` is the Bitwarden Secrets Manager CLI — a separate commercial product
that Vaultwarden does not implement.

### Pre-commit Hook

A security hook is installed in every genome's `.git/hooks/pre-commit`.
It inspects every staged file: if any file in `raw/private/` or `wiki/private/`
is not encrypted by git-crypt, the commit is blocked with a clear error message
explaining how to fix the issue.

### Key Rotation

If a key is lost or compromised:
```bash
source lib/git-crypt.sh
cd ~/knowledge-genome-setup/genome-dev
gcrypt_rotate_key "genome-dev"
```
The function decrypts all private files, generates a new key, re-encrypts,
and prints instructions for updating Vaultwarden.

---

## Agent Interaction

At the start of every AI session, declare the privacy context explicitly:

```text
PRIVATE_CONTEXT: disabled
```
The agent ignores all `private/` directories. Outputs are safe to share.

```text
PRIVATE_CONTEXT: enabled
```
The agent processes encrypted data. Requires the genome to be unlocked.
All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`.

---

## Knowledge Quality

The system includes three quality mechanisms drawn directly from the LLM Wiki pattern:

**Conflict Resolution** — when new evidence contradicts existing wiki content,
the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting.
Human review required before merging.

**Knowledge Decay** — pages with `maturity: stable` not updated in 6 months,
and `maturity: draft` pages not updated in 3 months, are flagged during lint passes
with a `⚠️ STALE` callout. The agent proposes re-validation but does not change
maturity without new source evidence.

**Cross-Genome Lint** — once a month, a manual session passes the aggregated index
of all genomes to the agent to detect concept duplication and missing cross-references.
No automated LLM controller in CI/CD — the cost in tokens and complexity is not
justified at this scale.