feat: Revamp README with new core philosophy and architecture
This commit is contained in:
parent
a797fb2f10
commit
16a10decf3
1 changed files with 138 additions and 139 deletions
277
README.md
277
README.md
|
|
@ -1,201 +1,200 @@
|
||||||
# Knowledge Genome System
|
# Knowledge Genome System
|
||||||
|
|
||||||
> A distributed, modular, and secure personal knowledge base architecture.
|
> A distributed, modular, and secure personal knowledge base — no vector database required.
|
||||||
|
|
||||||
The **Knowledge Genome System** is a framework designed to manage personal knowledge using a "Master-Genome" architecture. It follows the LLM-Wiki patterns (Karpathy-style) while adding a robust security layer for sensitive data and automated quality control.
|
The **Knowledge Genome System** implements the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||||||
|
by Andrej Karpathy, extended with a multi-domain submodule architecture, git-crypt
|
||||||
|
encryption for sensitive data, and a human-in-the-loop Git Flow for quality control.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# Architecture
|
## Core Philosophy
|
||||||
|
|
||||||
This project is structured as a **Master Orchestrator** that manages multiple independent **Genomes** via Git Submodules.
|
Most RAG systems make the LLM rediscover knowledge from scratch on every query.
|
||||||
|
This system is different: the LLM **incrementally builds and maintains a persistent wiki**
|
||||||
|
that sits between you and the raw sources. Knowledge is compiled once and kept current —
|
||||||
|
not re-derived on every question.
|
||||||
|
|
||||||
## Core Components
|
**This means: no vector database, no embedding pipeline, no external retrieval server.**
|
||||||
|
The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
|
||||||
|
(~100 sources, hundreds of pages) this works better than RAG because cross-references,
|
||||||
|
contradictions, and syntheses are already resolved — the LLM doesn't have to piece
|
||||||
|
them together at query time.
|
||||||
|
|
||||||
### Master Repository
|
If the wiki grows beyond what the index can navigate efficiently, the only recommended
|
||||||
|
search extension is [`qmd`](https://github.com/tobi/qmd) — a local, on-device
|
||||||
Contains:
|
BM25 + vector search engine for markdown files with an MCP server interface.
|
||||||
|
No external infrastructure required.
|
||||||
* Orchestration scripts
|
|
||||||
* Global configuration (`config.env`)
|
|
||||||
* Security templates
|
|
||||||
|
|
||||||
### Genomes
|
|
||||||
|
|
||||||
Individual specialized repositories (e.g. `genome-dev`, `genome-finance`) that act as standalone units of knowledge.
|
|
||||||
|
|
||||||
### Security Layers
|
|
||||||
|
|
||||||
#### Physical Security
|
|
||||||
|
|
||||||
`git-crypt` encrypts `private/` directories at rest.
|
|
||||||
|
|
||||||
#### Logical Security
|
|
||||||
|
|
||||||
YAML frontmatter (`private: true`) prevents AI agents from leaking sensitive data during public sessions.
|
|
||||||
|
|
||||||
#### Validation Layer
|
|
||||||
|
|
||||||
A custom linting engine ensures metadata consistency.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
# Quick Start
|
## Architecture
|
||||||
|
|
||||||
|
```text
|
||||||
|
master-knowledge-genome/ ← Root orchestrator
|
||||||
|
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
|
||||||
|
├── genome-dev/ ← Submodule: web dev, Angular, TUI
|
||||||
|
├── genome-finance/ ← Submodule: personal finance
|
||||||
|
├── genome-homelab/ ← Submodule: Keru infrastructure
|
||||||
|
└── AGENTS.md ← Global coordination schema
|
||||||
|
```
|
||||||
|
|
||||||
|
Each genome is an independent repository with this structure:
|
||||||
|
```text
|
||||||
|
genome-{name}/
|
||||||
|
├── raw/
|
||||||
|
│ ├── articles/ transcripts/ code-packs/ assets/ ← Plaintext, open to collaborators
|
||||||
|
│ └── private/ ← AES-256-CTR encrypted (git-crypt)
|
||||||
|
├── wiki/
|
||||||
|
│ ├── index.md log.md ← Navigation and audit trail
|
||||||
|
│ ├── sources/ entities/ concepts/ queries/ ← Agent-maintained knowledge
|
||||||
|
│ └── private/ ← AES-256-CTR encrypted (git-crypt)
|
||||||
|
└── AGENTS.md ← Per-genome agent contract
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
Required dependencies:
|
**Required:**
|
||||||
|
- `git`
|
||||||
|
- `git-crypt`
|
||||||
|
- `curl`
|
||||||
|
- `jq`
|
||||||
|
|
||||||
* `git`
|
**Optional:**
|
||||||
* `git-crypt`
|
- `bw` (Bitwarden CLI) — for runtime key injection from Vaultwarden without writing keys to disk
|
||||||
* `curl`
|
|
||||||
* `jq`
|
|
||||||
|
|
||||||
Optional:
|
Install on Ubuntu/Debian:
|
||||||
|
```bash
|
||||||
* `bw` (Bitwarden CLI) — used for runtime key injection
|
sudo apt update && sudo apt install -y git git-crypt curl jq
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Initialization
|
## Quick Start
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Clone the master repository
|
# 1. Clone this setup repository
|
||||||
git clone <master-repo-url> && cd master-knowledge-genome
|
git clone <setup-repo-url> knowledge-genome-setup
|
||||||
|
cd knowledge-genome-setup
|
||||||
|
|
||||||
# 2. Run the full setup
|
# 2. Export your Forgejo token
|
||||||
# (checks dependencies, creates master scaffold,
|
export FORGEJO_TOKEN="your_token_here"
|
||||||
# initializes genomes)
|
|
||||||
|
# 3. Run full setup
|
||||||
make setup
|
make setup
|
||||||
```
|
```
|
||||||
|
|
||||||
# Management Commands
|
`make setup` will:
|
||||||
|
- Check all dependencies
|
||||||
|
- Create the master and genome repositories on Forgejo
|
||||||
|
- Scaffold the local directory structure with git-crypt active on `private/`
|
||||||
|
- Install the pre-commit security hook in each genome
|
||||||
|
- Export the symmetric git-crypt keys to `keys/`
|
||||||
|
|
||||||
The system is controlled through a centralized Makefile.
|
---
|
||||||
|
|
||||||
| Command | Description |
|
## Management Commands
|
||||||
| ----------------- | -------------------------------------------------------------- |
|
|
||||||
| `make setup` | Full system initialization (Master + Registry Genomes). |
|
|
||||||
| `make add-genome` | Scaffolds and registers a new genome (requires NAME and DESC). |
|
|
||||||
| `make lint` | Runs the validation suite across all genomes. |
|
|
||||||
| `make status` | Checks Git status and encryption state for all submodules. |
|
|
||||||
|
|
||||||
# Validation & Linting (`make lint`)
|
| Command | Description |
|
||||||
|
|---------|-------------|
|
||||||
|
| `make setup` | Full system initialisation (master + all genomes defined in `config.env`) |
|
||||||
|
| `make add-genome NAME=x DESC="y"` | Scaffold and register a new genome |
|
||||||
|
| `make lint` | Validate schema, privacy flags, and metadata across all genomes |
|
||||||
|
| `make status` | Show git submodule status and first 10 git-crypt encryption states |
|
||||||
|
| `make help` | Show all available targets |
|
||||||
|
|
||||||
The built-in linter ensures that the knowledge base remains machine-readable and secure.
|
**Adding a new genome example:**
|
||||||
|
```bash
|
||||||
It automatically validates:
|
make add-genome NAME=genome-research DESC="Academic papers, deep-dives, open research"
|
||||||
|
|
||||||
## Frontmatter Integrity
|
|
||||||
|
|
||||||
Every `.md` file must contain valid YAML headers.
|
|
||||||
|
|
||||||
## Domain Consistency
|
|
||||||
|
|
||||||
Ensures that a file's domain metadata matches its parent genome.
|
|
||||||
|
|
||||||
## Privacy Leak Detection
|
|
||||||
|
|
||||||
Critical validation step.
|
|
||||||
|
|
||||||
Verifies that any file located in a `/private/` directory contains the flag:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
private: true
|
|
||||||
```
|
```
|
||||||
|
|
||||||
This prevents accidental exposure during AI sessions.
|
---
|
||||||
|
|
||||||
## Broken Wiki-Links
|
## Security Model
|
||||||
|
|
||||||
Detects dead `[[internal-links]]`.
|
### Hybrid Privacy Architecture
|
||||||
|
|
||||||
# Security Model
|
Each genome has two layers:
|
||||||
|
|
||||||
## Hybrid Privacy Architecture
|
| Layer | Directories | Access |
|
||||||
|
|-------|-------------|--------|
|
||||||
|
| Public | `raw/articles/`, `raw/transcripts/`, `wiki/sources/`, `wiki/concepts/` | Plaintext — safe for collaborators |
|
||||||
|
| Private | `raw/private/`, `wiki/private/` | AES-256-CTR via git-crypt — owner only |
|
||||||
|
|
||||||
Each genome is divided into two layers.
|
On the remote (Forgejo), private files are opaque binary blobs.
|
||||||
|
Collaborators without the key can contribute normally to public directories
|
||||||
|
— git handles the encrypted files transparently with no errors.
|
||||||
|
|
||||||
### Public Layer
|
### Runtime Key Injection
|
||||||
|
|
||||||
Directories:
|
Encryption keys are never stored as persistent files on the AI server.
|
||||||
|
They are injected at session start via the Bitwarden CLI (`bw`) against
|
||||||
```text
|
your self-hosted Vaultwarden instance, using process substitution:
|
||||||
raw/public/
|
|
||||||
wiki/public/
|
|
||||||
```
|
|
||||||
|
|
||||||
Characteristics:
|
|
||||||
|
|
||||||
* Plaintext
|
|
||||||
* Shareable with collaborators
|
|
||||||
|
|
||||||
### Private Layer
|
|
||||||
|
|
||||||
Directories:
|
|
||||||
|
|
||||||
```text
|
|
||||||
raw/private/
|
|
||||||
wiki/private/
|
|
||||||
```
|
|
||||||
|
|
||||||
Characteristics:
|
|
||||||
|
|
||||||
* Encrypted using AES-256 via `git-crypt`
|
|
||||||
|
|
||||||
## Runtime Key Injection
|
|
||||||
|
|
||||||
To keep the AI environment secure, encryption keys are never stored on the VM disk.
|
|
||||||
|
|
||||||
Instead, the system uses Bitwarden (`bw`) / Vaultwarden for runtime injection.
|
|
||||||
|
|
||||||
### Example
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Unlock a genome using a key stored in Vaultwarden
|
# Key lives only in a kernel file descriptor — never touches disk
|
||||||
git-crypt unlock <(
|
git-crypt unlock <(
|
||||||
bw get notes "genome-dev key" \
|
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
|
||||||
--session "$BW_SESSION" | base64 -d
|
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
# Genome Schema
|
**Use `bw` (standard Bitwarden CLI), not `bws`.**
|
||||||
|
`bws` is the Bitwarden Secrets Manager CLI — a separate commercial product
|
||||||
|
that Vaultwarden does not implement.
|
||||||
|
|
||||||
All wiki documents follow a strict schema to support AI ingestion.
|
### Pre-commit Hook
|
||||||
|
|
||||||
## YAML Frontmatter Schema
|
A security hook is installed in every genome's `.git/hooks/pre-commit`.
|
||||||
|
It inspects every staged file: if any file in `raw/private/` or `wiki/private/`
|
||||||
|
is not encrypted by git-crypt, the commit is blocked with a clear error message
|
||||||
|
explaining how to fix the issue.
|
||||||
|
|
||||||
```yaml
|
### Key Rotation
|
||||||
---
|
|
||||||
title: "Document Title"
|
If a key is lost or compromised:
|
||||||
type: entity | concept | source | log
|
```bash
|
||||||
domain: genome-name
|
source lib/git-crypt.sh
|
||||||
private: true/false
|
cd ~/knowledge-genome-setup/genome-dev
|
||||||
last_updated: YYYY-MM-DD
|
gcrypt_rotate_key "genome-dev"
|
||||||
---
|
|
||||||
```
|
```
|
||||||
|
The function decrypts all private files, generates a new key, re-encrypts,
|
||||||
|
and prints instructions for updating Vaultwarden.
|
||||||
|
|
||||||
# Agent Interaction
|
---
|
||||||
|
|
||||||
When starting a session with an AI agent, always declare the privacy context.
|
## Agent Interaction
|
||||||
|
|
||||||
## Public Context
|
At the start of every AI session, declare the privacy context explicitly:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
PRIVATE_CONTEXT: disabled
|
PRIVATE_CONTEXT: disabled
|
||||||
```
|
```
|
||||||
|
The agent ignores all `private/` directories. Outputs are safe to share.
|
||||||
Behavior:
|
|
||||||
|
|
||||||
* The agent ignores all private folders.
|
|
||||||
|
|
||||||
## Private Context
|
|
||||||
|
|
||||||
```text
|
```text
|
||||||
PRIVATE_CONTEXT: enabled
|
PRIVATE_CONTEXT: enabled
|
||||||
```
|
```
|
||||||
|
The agent processes encrypted data. Requires the genome to be unlocked.
|
||||||
|
All outputs referencing private data are prefixed with `[PRIVATE DATA INCLUDED]`.
|
||||||
|
|
||||||
Behavior:
|
---
|
||||||
|
|
||||||
* The agent processes encrypted data.
|
## Knowledge Quality
|
||||||
* Requires the repository to be unlocked.
|
|
||||||
|
The system includes three quality mechanisms drawn directly from the LLM Wiki pattern:
|
||||||
|
|
||||||
|
**Conflict Resolution** — when new evidence contradicts existing wiki content,
|
||||||
|
the agent creates a `wiki/queries/conflict-*.md` node instead of silently overwriting.
|
||||||
|
Human review required before merging.
|
||||||
|
|
||||||
|
**Knowledge Decay** — pages with `maturity: stable` not updated in 6 months,
|
||||||
|
and `maturity: draft` pages not updated in 3 months, are flagged during lint passes
|
||||||
|
with a `⚠️ STALE` callout. The agent proposes re-validation but does not change
|
||||||
|
maturity without new source evidence.
|
||||||
|
|
||||||
|
**Cross-Genome Lint** — once a month, a manual session passes the aggregated index
|
||||||
|
of all genomes to the agent to detect concept duplication and missing cross-references.
|
||||||
|
No automated LLM controller in CI/CD — the cost in tokens and complexity is not
|
||||||
|
justified at this scale.
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue