1038 lines
36 KiB
Markdown
1038 lines
36 KiB
Markdown
# Knowledge Genome System
|
||
|
||
> A distributed, encrypted, multi-domain personal knowledge base.
|
||
> No vector database. No embedding pipeline. No external retrieval server.
|
||
|
||
Built on the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||
by Andrej Karpathy — extended with a multi-domain submodule architecture,
|
||
AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection,
|
||
and a human-in-the-loop Git Flow for quality control.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Core Philosophy](#core-philosophy)
|
||
2. [Architecture](#architecture)
|
||
3. [System Requirements](#system-requirements)
|
||
4. [Prerequisites](#prerequisites)
|
||
5. [Configuration](#configuration)
|
||
6. [Quick Start](#quick-start)
|
||
7. [Makefile Reference](#makefile-reference)
|
||
8. [Genome Lifecycle](#genome-lifecycle)
|
||
9. [Security Model](#security-model)
|
||
10. [Key Management](#key-management)
|
||
11. [Agent Sessions](#agent-sessions)
|
||
12. [Workflows](#workflows)
|
||
13. [Knowledge Quality](#knowledge-quality)
|
||
14. [Knowledge Schema](#knowledge-schema)
|
||
15. [Collaboration Model](#collaboration-model)
|
||
16. [Optional Extensions](#optional-extensions)
|
||
17. [Troubleshooting](#troubleshooting)
|
||
|
||
---
|
||
|
||
## Core Philosophy
|
||
|
||
Most RAG systems make the LLM rediscover knowledge from scratch on every query.
|
||
A document is indexed; at query time, relevant chunks are retrieved; an answer is generated.
|
||
Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM
|
||
pieces it together from fragments every single time.
|
||
|
||
This system is different. Instead of retrieval at query time, the LLM
|
||
**incrementally builds and maintains a persistent wiki** that sits between you and the raw
|
||
sources. When a new source arrives, the LLM reads it, extracts key information, updates
|
||
entity and concept pages, flags contradictions with existing claims, and strengthens the
|
||
evolving synthesis. Knowledge is compiled once and kept current.
|
||
|
||
**The wiki is a compounding artifact.** Cross-references are already there.
|
||
Contradictions have been flagged. The synthesis already reflects everything ingested.
|
||
|
||
This means:
|
||
- No vector database.
|
||
- No embedding pipeline.
|
||
- No external retrieval infrastructure.
|
||
|
||
The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
|
||
(~100 sources, hundreds of pages) this performs better than RAG because cross-references,
|
||
contradictions, and syntheses are already resolved — not re-derived per query.
|
||
|
||
The human's job: curate sources, direct analysis, ask good questions, review PRs.
|
||
The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
### Repository structure
|
||
|
||
```text
|
||
master-knowledge-genome/ ← Root orchestrator (submodule registry)
|
||
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
|
||
├── genome-dev/ ← Submodule: web development, Angular, TUI
|
||
├── genome-finance/ ← Submodule: personal finance, investments
|
||
├── genome-homelab/ ← Submodule: Keru infrastructure, network configs
|
||
└── AGENTS.md ← Global coordination schema (cross-genome rules)
|
||
```
|
||
|
||
Each genome is an independent git repository:
|
||
|
||
```text
|
||
genome-{name}/
|
||
├── .gitattributes ← Encryption rules — **/private/** wildcard
|
||
├── .gitignore
|
||
├── .git/hooks/pre-commit ← Security hook (dynamic git check-attr)
|
||
├── AGENTS.md ← Per-genome agent contract and workflow rules
|
||
│
|
||
├── raw/ ← Immutable sources — LLM reads, never writes
|
||
│ ├── articles/ ← Web clips, saved articles
|
||
│ ├── transcripts/ ← Audio/video transcripts
|
||
│ ├── code-packs/ ← Code snippets and repositories
|
||
│ ├── assets/ ← Images, PDFs, binary files
|
||
│ └── private/ ← AES-256-CTR encrypted — owner only
|
||
│
|
||
└── wiki/ ← LLM-owned — agent creates and maintains
|
||
├── index.md ← Primary catalog (read first every session)
|
||
├── log.md ← Append-only operations ledger
|
||
├── sources/ ← One page per processed raw source
|
||
├── entities/ ← People, tools, organisations, projects
|
||
├── concepts/ ← Patterns, theories, architectural decisions
|
||
├── queries/ ← Preserved answers and conflict notes
|
||
└── private/ ← AES-256-CTR encrypted — owner only
|
||
```
|
||
|
||
### Three layers
|
||
|
||
| Layer | Path | Owner | Rule |
|
||
|-------|------|-------|------|
|
||
| Raw sources | `raw/` | Human | Immutable. LLM reads only. Never modified. |
|
||
| Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. |
|
||
| Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |
|
||
|
||
### Framework structure
|
||
|
||
```text
|
||
knowledge-genome-setup/ ← This repository (setup tooling)
|
||
├── globals.env ← Static KEY=VALUE config (Make-includable)
|
||
├── registry.sh ← Bash-only: GENOMES array + dynamic paths
|
||
├── Makefile ← Entry point for all operations
|
||
├── lib/
|
||
│ ├── output.sh ← Terminal helpers (colors, log levels)
|
||
│ ├── deps.sh ← Dependency validation
|
||
│ ├── scaffold.sh ← Template rendering engine
|
||
│ ├── lint.sh ← Per-file validation functions
|
||
│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate)
|
||
├── providers/
|
||
│ ├── forgejo.sh ← Forgejo REST API provider
|
||
│ └── github.sh ← GitHub REST API provider
|
||
├── scripts/
|
||
│ ├── setup.sh ← Main entry point
|
||
│ ├── setup-master.sh ← Master repo initialisation
|
||
│ ├── setup-genomes.sh ← Genome provisioning loop
|
||
│ ├── add-genome.sh ← Add a single new genome
|
||
│ └── lint-genomes.sh ← Quality control across all genomes
|
||
└── templates/
|
||
├── agents-genome.md ← Per-genome agent contract template
|
||
├── agents-master.md ← Master coordination schema template
|
||
├── wiki-index.md ← Index template (rendered per genome)
|
||
├── wiki-log.md ← Log template (rendered per genome)
|
||
├── pr-description.md ← PR review checklist template
|
||
├── pre-commit.sh ← Security hook template
|
||
├── gitattributes ← Git encryption rules template
|
||
└── gitignore ← Git ignore template
|
||
```
|
||
|
||
---
|
||
|
||
## System Requirements
|
||
|
||
### Linux — full support (primary target)
|
||
|
||
All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+.
|
||
All tools (git-crypt, bw, qmd) have native Linux binaries.
|
||
|
||
### macOS — full support
|
||
|
||
All scripts are compatible with macOS. Requirements:
|
||
- bash 3.2+ (macOS default) — fully supported. All `bash 4+` constructs removed.
|
||
- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
|
||
- `git-crypt`: install via Homebrew — `brew install git-crypt`
|
||
- `jq`, `curl`: pre-installed or via Homebrew
|
||
|
||
If you use Homebrew bash (`brew install bash`), the scripts work identically to Linux.
|
||
|
||
### Windows — WSL2 only
|
||
|
||
**Git Bash and native Windows are not supported.**
|
||
|
||
Reasons:
|
||
- `git-crypt` has no native Windows binary.
|
||
- Process substitution `<(...)` used for runtime key injection is not available
|
||
in Git Bash or PowerShell.
|
||
- Several bash builtins used throughout (`compgen`, `BASH_SOURCE`, arrays) are not
|
||
available outside a POSIX-compliant shell.
|
||
|
||
**WSL2 (Windows Subsystem for Linux)** with Ubuntu gives full compatibility.
|
||
All setup and runtime operations work identically to native Linux inside WSL2.
|
||
|
||
### Hardware recommendations
|
||
|
||
The system is designed for a homelab architecture:
|
||
|
||
| Component | Recommended | Role |
|
||
|-----------|-------------|------|
|
||
| Storage node | Any Linux server with NFS | Hosts Forgejo, stores genome repos |
|
||
| AI compute node | GPU server (16GB+ VRAM) | Runs local LLM agent sessions |
|
||
| VRAM | 16GB minimum | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache |
|
||
| Local LLM | 14B–32B quantised | Active wiki maintenance sessions |
|
||
| Large LLM | 70B (async) | Deep reflection, complex synthesis (scheduled, not interactive) |
|
||
|
||
> **On VRAM constraints:** with a 16GB card and a 14B model, the KV cache budget
|
||
> is ~6GB — approximately 32k tokens of effective context. Every token in `AGENTS.md`,
|
||
> the index, and the log tail is a cost. This is why all agent files are token-optimised
|
||
> and sessions are kept to one source at a time.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
### Required
|
||
|
||
| Tool | Purpose |
|
||
|------|---------|
|
||
| `git` | Version control |
|
||
| `git-crypt` | Transparent file encryption |
|
||
| `curl` | REST API calls to Forgejo/GitHub |
|
||
| `jq` | JSON parsing |
|
||
|
||
### Optional
|
||
|
||
| Tool | Purpose |
|
||
|------|---------|
|
||
| `bw` | Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) |
|
||
| `qmd` | Local BM25 + vector search for Markdown files with MCP server interface |
|
||
|
||
> **`bw` vs `bws`:** Use `bw` (standard Bitwarden CLI). `bws` is the Bitwarden
|
||
> Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.
|
||
|
||
### Install on Ubuntu/Debian
|
||
|
||
```bash
|
||
sudo apt update && sudo apt install -y git git-crypt curl jq
|
||
```
|
||
|
||
### Install on macOS
|
||
|
||
```bash
|
||
brew install git git-crypt curl jq
|
||
```
|
||
|
||
### Install Bitwarden CLI
|
||
|
||
```bash
|
||
# Linux
|
||
npm install -g @bitwarden/cli
|
||
|
||
# macOS
|
||
brew install bitwarden-cli
|
||
```
|
||
|
||
### Verify all tools
|
||
|
||
```bash
|
||
make doctor
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
Configuration is split into two files with distinct purposes:
|
||
|
||
### `globals.env` — static KEY=VALUE
|
||
|
||
Safe for `make include`, `docker-compose`, shell `source`, and any standard env parser.
|
||
Contains only simple scalar values — no bash syntax, no arrays.
|
||
|
||
```bash
|
||
# Provider selection
|
||
PROVIDER=forgejo # forgejo | github
|
||
|
||
# Forgejo (active when PROVIDER=forgejo)
|
||
FORGEJO_URL=https://git.yourserver.com
|
||
FORGEJO_USER=yourusername
|
||
FORGEJO_SSH_PORT=222 # Default for many homelab Forgejo setups; 22 for standard
|
||
|
||
# GitHub (active when PROVIDER=github — uncomment to use)
|
||
# GITHUB_USER=your-username
|
||
# GITHUB_ORG=your-org # Optional: for org repos, overrides GITHUB_USER
|
||
|
||
# Vaultwarden
|
||
VAULTWARDEN_URL=https://vault.yourserver.com
|
||
|
||
# Master repository
|
||
MASTER_REPO=master-knowledge-genome
|
||
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git
|
||
```
|
||
|
||
### `registry.sh` — bash runtime config
|
||
|
||
Sourced by shell scripts only. Contains the genome registry array and dynamic path
|
||
resolution. Never included by Make.
|
||
|
||
```bash
|
||
# Dynamic paths (resolved at source time)
|
||
WORK_DIR="${HOME}/knowledge-genome-setup"
|
||
KEYS_DIR="${WORK_DIR}/keys"
|
||
|
||
# Genome registry — format: "name|description"
|
||
GENOMES=(
|
||
"genome-dev|Web development, TUI, Angular, software architecture"
|
||
"genome-finance|Personal finance, investments, market analysis"
|
||
"genome-homelab|Infrastructure, network configs, architecture logs"
|
||
)
|
||
```
|
||
|
||
To add a genome to the registry before running setup, append a line to `GENOMES`.
|
||
After initial setup, use `make add-genome` instead.
|
||
|
||
### Tokens
|
||
|
||
Tokens are never stored in config files. Export them in your shell before running setup:
|
||
|
||
```bash
|
||
export FORGEJO_TOKEN="your_forgejo_token"
|
||
# or
|
||
export GITHUB_TOKEN="your_github_token"
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
# 1. Clone the setup framework
|
||
git clone <setup-repo-url> knowledge-genome-setup
|
||
cd knowledge-genome-setup
|
||
|
||
# 2. Configure your environment
|
||
cp globals.env.example globals.env # edit with your values
|
||
# Edit registry.sh to define your genomes
|
||
|
||
# 3. Export your provider token
|
||
export FORGEJO_TOKEN="your_token_here"
|
||
|
||
# 4. Verify dependencies
|
||
make doctor
|
||
|
||
# 5. Run full setup
|
||
make setup
|
||
```
|
||
|
||
`make setup` executes in order:
|
||
|
||
1. **Dependency check** — verifies all required tools are installed
|
||
2. **Git identity check** — warns if `user.name` / `user.email` are not configured
|
||
3. **Master repo** — creates `master-knowledge-genome` on Forgejo, scaffolds with
|
||
`AGENTS.md` and `README.md`, initialises git, adds `core-karpathy` as submodule, pushes
|
||
4. **Genome provisioning** — for each genome in `registry.sh`:
|
||
- Creates remote repository on Forgejo
|
||
- Adds it as a submodule in the master repo
|
||
- Initialises git-crypt (**before any files are created**)
|
||
- Scaffolds directory structure and renders all templates
|
||
- Installs pre-commit security hook
|
||
- Commits, pushes genome to remote
|
||
- Exports symmetric key to `keys/<genome>.key`
|
||
- Prints Vaultwarden upload instructions
|
||
- Commits submodule pointer in master repo
|
||
|
||
After setup completes:
|
||
- Upload all files in `keys/` to Vaultwarden (see Key Management)
|
||
- Delete key files from disk: `rm keys/*.key`
|
||
|
||
---
|
||
|
||
## Makefile Reference
|
||
|
||
| Target | Description |
|
||
|--------|-------------|
|
||
| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` |
|
||
| `make add-genome NAME=x DESC="y"` | Scaffold and register a single new genome |
|
||
| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) |
|
||
| `make status` | Show submodule status and first 10 git-crypt encryption states |
|
||
| `make lock` | Lock all encrypted repos (master + all genome submodules) |
|
||
| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
|
||
| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome |
|
||
| `make help` | Print all available targets |
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Check system health
|
||
make doctor
|
||
|
||
# Add a new genome after initial setup
|
||
make add-genome NAME=genome-research DESC="Academic papers and deep research"
|
||
|
||
# Run full lint pass (bash deterministic checks)
|
||
make lint
|
||
|
||
# Sync all nodes after pulling on another machine
|
||
make sync
|
||
|
||
# Emergency lock — secures all repos before leaving a session
|
||
make lock
|
||
```
|
||
|
||
---
|
||
|
||
## Genome Lifecycle
|
||
|
||
### Initial setup
|
||
|
||
All genomes defined in `registry.sh` are provisioned by `make setup`.
|
||
|
||
### Adding a genome after initial setup
|
||
|
||
```bash
|
||
make add-genome NAME=genome-newname DESC="Domain description"
|
||
```
|
||
|
||
This: creates the remote repo, adds it as a submodule, initialises git-crypt,
|
||
scaffolds the directory structure, installs the pre-commit hook, commits and pushes,
|
||
exports the key, and commits the submodule pointer in master.
|
||
|
||
After adding: upload the new key to Vaultwarden and delete the key file.
|
||
|
||
### Removing a genome
|
||
|
||
Manual process:
|
||
```bash
|
||
# In master repo
|
||
git submodule deinit genome-name
|
||
git rm genome-name
|
||
git commit -m "chore: remove genome-name submodule"
|
||
git push
|
||
# Archive or delete the remote repository on Forgejo
|
||
```
|
||
|
||
### Template rendering
|
||
|
||
When a genome is scaffolded, `render_template` replaces these placeholders in all
|
||
template files:
|
||
|
||
| Placeholder | Source | Example |
|
||
|-------------|--------|---------|
|
||
| `{{GENOME_NAME}}` | registry.sh | `genome-dev` |
|
||
| `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` |
|
||
| `{{GENOME_DESC}}` | registry.sh | `Web development...` |
|
||
| `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` |
|
||
| `{{FORGEJO_USER}}` | globals.env | `yourusername` |
|
||
| `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` |
|
||
| `{{MASTER_REPO}}` | globals.env | `master-knowledge-genome` |
|
||
| `{{DATE}}` | runtime | `2026-05-11` |
|
||
|
||
---
|
||
|
||
## Security Model
|
||
|
||
### Encryption architecture
|
||
|
||
Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt.
|
||
Two directories in every genome are always encrypted:
|
||
|
||
| Directory | Contents | On remote |
|
||
|-----------|----------|-----------|
|
||
| `raw/private/` | Sensitive source material | Opaque binary blob |
|
||
| `wiki/private/` | Private synthesis and notes | Opaque binary blob |
|
||
|
||
All other directories (`raw/articles/`, `wiki/sources/`, etc.) are plaintext.
|
||
Collaborators without the key can contribute to public directories normally —
|
||
git handles encrypted files transparently.
|
||
|
||
### `.gitattributes` — dynamic encryption rules
|
||
|
||
Encryption rules use a glob wildcard that catches any `private/` directory at
|
||
any depth in the repository — including directories created at runtime by the LLM:
|
||
|
||
```gitattributes
|
||
# Text rules first
|
||
*.md text eol=lf
|
||
*.sh text eol=lf
|
||
|
||
# Encryption rules LAST (later rules override per-attribute)
|
||
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
|
||
**/private/** filter=git-crypt diff=git-crypt -text
|
||
```
|
||
|
||
> Rule ordering matters: in `.gitattributes`, the last matching rule wins per attribute.
|
||
> Encryption rules must come after text rules so `-text` overrides `text eol=lf`
|
||
> for encrypted markdown files.
|
||
|
||
### Pre-commit hook — dynamic validation
|
||
|
||
The security hook installed at `.git/hooks/pre-commit` validates every staged file
|
||
dynamically — it reads encryption requirements from `.gitattributes` at runtime
|
||
rather than checking hardcoded paths:
|
||
|
||
```bash
|
||
# For each staged file, check if git-crypt encryption is required
|
||
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
|
||
if [[ "$filter" == "git-crypt" ]]; then
|
||
# Verify the file is actually encrypted
|
||
if git-crypt status "$file" | grep -q "not encrypted"; then
|
||
# BLOCK THE COMMIT
|
||
fi
|
||
fi
|
||
```
|
||
|
||
This means: any file matching `**/private/**` in `.gitattributes` is protected,
|
||
including future `private/` directories created anywhere in the repo.
|
||
The hook never needs updating when the encryption rules change.
|
||
|
||
### PRIVATE_CONTEXT toggle
|
||
|
||
The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
|
||
accesses encrypted directories. It must be declared explicitly by the operator
|
||
at the start of every session:
|
||
|
||
```text
|
||
PRIVATE_CONTEXT: disabled ← Default. private/ directories are treated as non-existent.
|
||
PRIVATE_CONTEXT: enabled ← Agent may read/write private/. Requires git-crypt unlock.
|
||
```
|
||
|
||
Rules:
|
||
- Never inferred. Never carried over from a previous session.
|
||
- `enabled` requires the operator to confirm that `git-crypt unlock` has run on the host.
|
||
- Per-genome, per-session: enabling for `genome-finance` does NOT enable for `genome-dev`.
|
||
- Cloud LLM models: `PRIVATE_CONTEXT` must always be `disabled`. Private data never leaves the local network.
|
||
- All outputs derived from private data are prefixed `[PRIVATE DATA INCLUDED]`.
|
||
- Private synthesis goes exclusively to `wiki/private/` — never to public wiki paths.
|
||
|
||
### Runtime key injection — zero disk policy
|
||
|
||
Encryption keys are never stored as persistent files on the AI server.
|
||
They are injected at session start via the Bitwarden CLI (`bw`) against
|
||
your self-hosted Vaultwarden instance, using process substitution:
|
||
|
||
```bash
|
||
# Step 1: authenticate
|
||
bw config server https://vault.yourserver.com
|
||
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
|
||
|
||
# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
|
||
git-crypt unlock <(
|
||
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
|
||
)
|
||
```
|
||
|
||
The key flows: Vaultwarden → `bw get notes` → `base64 -d` → kernel pipe → `git-crypt`.
|
||
At no point is the key written to any file on disk.
|
||
|
||
Lock a genome when the session ends:
|
||
```bash
|
||
git-crypt lock
|
||
```
|
||
|
||
---
|
||
|
||
## Key Management
|
||
|
||
> This section is for the operator. These commands are never issued by the LLM agent.
|
||
|
||
### Vaultwarden Secure Notes
|
||
|
||
Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:
|
||
|
||
| Genome | Vaultwarden Note Name |
|
||
|--------|----------------------|
|
||
| `genome-dev` | `genome-dev key` |
|
||
| `genome-finance` | `genome-finance key` |
|
||
| `genome-homelab` | `genome-homelab key` |
|
||
|
||
After `make setup` or `make add-genome`, key files are exported to `keys/`.
|
||
Upload procedure:
|
||
|
||
```bash
|
||
# Encode the key
|
||
base64 < keys/genome-dev.key
|
||
|
||
# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
|
||
# Then delete the key file
|
||
rm keys/genome-dev.key
|
||
```
|
||
|
||
### Cloning on a new machine
|
||
|
||
```bash
|
||
# Full clone with all submodules
|
||
git clone --recurse-submodules \
|
||
https://git.yourserver.com/yourusername/master-knowledge-genome.git
|
||
|
||
# Unlock a specific genome (with key file — development only)
|
||
cd master-knowledge-genome/genome-dev
|
||
git-crypt unlock /path/to/genome-dev.key
|
||
|
||
# Unlock via Vaultwarden (recommended — no key on disk)
|
||
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
|
||
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)
|
||
|
||
# Sparse clone — collaborator who only needs one genome
|
||
git clone https://git.yourserver.com/yourusername/genome-dev.git
|
||
```
|
||
|
||
### Key rotation (emergency)
|
||
|
||
If a key is lost or compromised:
|
||
|
||
```bash
|
||
# From the knowledge-genome-setup/ directory
|
||
source lib/git-crypt.sh
|
||
cd ~/knowledge-genome-setup/genome-dev
|
||
gcrypt_rotate_key "genome-dev"
|
||
```
|
||
|
||
`gcrypt_rotate_key` performs:
|
||
1. Unlocks repo with existing key
|
||
2. Removes old key material
|
||
3. Generates new symmetric key via `git-crypt init`
|
||
4. Re-stages and commits private files (encrypted with new key)
|
||
5. Exports new key to `keys/`
|
||
6. Prints Vaultwarden update instructions
|
||
|
||
> **Limitation:** git history still contains blobs encrypted with the old key.
|
||
> Anyone with the old key and git history access can decrypt them. To purge old
|
||
> encrypted blobs from history:
|
||
> ```bash
|
||
> git filter-repo --invert-paths --path raw/private --path wiki/private
|
||
> git push --force origin main
|
||
> ```
|
||
> This rewrites all commit hashes — coordinate with any collaborators first.
|
||
|
||
After rotation:
|
||
- Upload new key to Vaultwarden (replace existing note)
|
||
- Delete both `keys/genome-dev.key` and `keys/genome-dev-rotated-*.key` from disk
|
||
- Revoke access from previous key holders
|
||
|
||
---
|
||
|
||
## Agent Sessions
|
||
|
||
### Prerequisites for every session
|
||
|
||
Before starting an LLM agent session on a genome:
|
||
1. The host (AI server) runs `git-crypt unlock` for the required genomes
|
||
2. The orchestrator prepares context: `tail -n 20 wiki/log.md`
|
||
3. Declare `PRIVATE_CONTEXT` state explicitly in the opening prompt
|
||
|
||
### Session start protocol
|
||
|
||
The agent executes in this order at the start of every session:
|
||
|
||
1. Read `wiki/index.md` — primary catalog of all pages and maturity
|
||
2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
|
||
3. For tasks involving related pages: `qmd search "<query>"` before opening any files
|
||
4. Operate on individual files — never scan entire directories
|
||
|
||
### One source per session
|
||
|
||
With a 14B model and ~6GB KV cache budget, long sessions degrade.
|
||
As the session extends, the context fills with pages already created,
|
||
attention dilutes, and later entities receive worse cross-references than earlier ones.
|
||
|
||
**Hard rule: one source per session.**
|
||
If multiple sources are queued in `raw/`, process only the first.
|
||
Commit, close the session. The orchestrator (n8n or script) starts a new session
|
||
for the next source with a clean KV cache.
|
||
|
||
For automated pipelines: if 5 files arrive in `raw/`, trigger 5 agent sessions
|
||
sequentially — not one session with 5 files.
|
||
|
||
### n8n automation
|
||
|
||
For Forgejo webhook → automated ingest:
|
||
1. Forgejo sends webhook on push to `raw/`
|
||
2. n8n receives webhook, identifies new files
|
||
3. n8n starts one agent session per new file (sequential, not parallel)
|
||
4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
|
||
5. Agent ingest workflow runs, opens PR
|
||
6. Human reviews and merges PR
|
||
|
||
---
|
||
|
||
## Workflows
|
||
|
||
### Ingest
|
||
|
||
Triggered by a new file in `raw/` (manual or via webhook).
|
||
|
||
1. Read source once
|
||
2. Create `wiki/sources/<slug>.md` — summary and key points
|
||
3. Per entity (person, tool, organisation): create or update `wiki/entities/<name>.md`
|
||
4. Per concept (pattern, theory, decision): create or update `wiki/concepts/<name>.md`
|
||
5. Check each touched page for contradictions → apply Conflict Resolution if found
|
||
6. Append entry to `wiki/index.md` (bottom of relevant section — do not reorder)
|
||
7. Append log entry: `INGEST | <slug>`
|
||
8. Run scoped lint on pages created or modified in this session; report in PR
|
||
9. Commit on `feat/ai-ingest-<slug>`; open PR using `templates/pr-description.md`
|
||
|
||
For private sources (`PRIVATE_CONTEXT: enabled` required):
|
||
- All output goes to `wiki/private/<slug>.md` only
|
||
- PR title: `[PRIVATE] ingest: <slug>`
|
||
|
||
### Query
|
||
|
||
Triggered by an operator question.
|
||
|
||
1. `qmd search "<query>"` → identify candidate pages
|
||
2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
|
||
3. Synthesise answer with `[[wikilink]]` citations
|
||
4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
|
||
5. Append log entry: `QUERY | <subject>`
|
||
|
||
For general orientation without a specific query: read `wiki/index.md` directly.
|
||
|
||
### Lint
|
||
|
||
The lint workflow is split between deterministic bash checks and semantic LLM judgment.
|
||
|
||
**Step 1 — operator runs bash linter:**
|
||
```bash
|
||
make lint
|
||
```
|
||
|
||
The bash linter checks automatically:
|
||
- YAML frontmatter validity (all mandatory fields present)
|
||
- Domain consistency (domain field matches genome name)
|
||
- Type validity (value from allowed list)
|
||
- Privacy consistency (`private/` directories have `private: true`)
|
||
- Page size (warn at 400 lines, error at 800 lines)
|
||
- Knowledge decay (stable > 180 days, draft > 90 days)
|
||
- Broken internal wikilinks (warnings only — cross-type links produce expected false positives)
|
||
|
||
**Step 2 — operator provides bash output to LLM agent:**
|
||
|
||
The agent applies semantic judgment to findings the bash linter cannot make:
|
||
- **Orphan pages** (from bash list): for each orphan, identify 1-3 existing pages
|
||
that should link to it; propose specific additions
|
||
- **Implicit concepts** (from bash term frequency list): determine if a candidate
|
||
term warrants a dedicated page; draft stub if yes
|
||
- **Duplicate concepts**: `qmd search "<concept>"` for suspected duplicates;
|
||
propose merge if confirmed
|
||
- **Maturity promotion**: pages with 2+ sources still marked `draft` → propose `stable`
|
||
|
||
The agent reports all findings as a structured list. It does not modify files
|
||
without operator approval. Appends `LINT | <summary>` log entry.
|
||
|
||
---
|
||
|
||
## Knowledge Quality
|
||
|
||
### PR review workflow
|
||
|
||
Every agent session that modifies wiki pages opens a PR.
|
||
The PR description uses `templates/pr-description.md`:
|
||
|
||
```markdown
|
||
## Summary
|
||
One sentence: goal of this session and source processed.
|
||
|
||
## Pages Created
|
||
| Path | Type | Maturity |
|
||
|
||
## Pages Modified
|
||
| Path | Change |
|
||
|
||
## Contradictions Found
|
||
[ ] None / [ ] n conflict file(s) created
|
||
|
||
## Private Data Accessed
|
||
[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes
|
||
|
||
## Scoped Lint (post-ingest)
|
||
[ ] Frontmatter valid [ ] No broken links [ ] No issues found
|
||
```
|
||
|
||
This makes human review fast and structured: read the table, scan the diff,
|
||
approve or request changes. No exploration required to understand what the agent did.
|
||
|
||
### Conflict resolution
|
||
|
||
When new evidence contradicts an existing wiki claim:
|
||
|
||
1. Keep the existing page unchanged
|
||
2. Create `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md` with:
|
||
- The existing claim and its source
|
||
- The contradicting evidence and its source
|
||
- Agent confidence assessment for each
|
||
- Recommendation: `accept_b` | `keep_a` | `requires_human_review`
|
||
3. Add entry to `wiki/index.md` → Conflicts Pending Review section
|
||
4. Log entry: `CONFLICT | <concept>`
|
||
5. Open PR: `[CONFLICT] <concept> — human review required`
|
||
|
||
The operator resolves the conflict, updates relevant pages, closes the PR.
|
||
|
||
### Knowledge decay
|
||
|
||
Pages have a `last_updated` field in frontmatter. During lint passes:
|
||
|
||
| Maturity | Threshold | Action |
|
||
|----------|-----------|--------|
|
||
| `stable` | 180 days | Flag as stale — add `⚠️ STALE` callout |
|
||
| `draft` | 90 days | Flag as stale — add `⚠️ STALE` callout |
|
||
|
||
The agent proposes re-validation but does not change `maturity` without new source evidence.
|
||
|
||
### Cross-genome lint
|
||
|
||
A manual, monthly operation. Not automated in CI/CD — the token cost and coordination
|
||
complexity are not justified at this scale.
|
||
|
||
1. Operator initiates a master-repo agent session
|
||
2. Agent uses `qmd search "<concept>"` across the multi-genome index to find:
|
||
- Concepts defined in 2+ genomes with potentially conflicting definitions
|
||
- Entities referenced cross-genome without canonical cross-genome wikilinks
|
||
- Concepts in genome-X that should link to genome-Y
|
||
3. Agent reports findings — does not modify files
|
||
4. For each finding: create conflict note in the genome where resolution belongs
|
||
|
||
---
|
||
|
||
## Knowledge Schema
|
||
|
||
### Frontmatter
|
||
|
||
Every wiki page must start with valid YAML frontmatter:
|
||
|
||
```yaml
|
||
---
|
||
title: "Strict String Title"
|
||
type: source | entity | concept | query | conflict | private
|
||
domain: genome-name
|
||
tags: [lowercase, hyphen-separated]
|
||
maturity: draft | stable | deprecated
|
||
last_updated: YYYY-MM-DD
|
||
private: true | false
|
||
---
|
||
```
|
||
|
||
| Field | Rules |
|
||
|-------|-------|
|
||
| `type` | Must be one of: `source entity concept query conflict private index log` |
|
||
| `maturity: draft` | Single source or unvalidated |
|
||
| `maturity: stable` | Confirmed by 2+ independent sources |
|
||
| `maturity: deprecated` | Superseded — add `> **DEPRECATED:** <reason>` callout at top |
|
||
| `private: true` | Required on all pages in `wiki/private/` and `raw/private/` |
|
||
|
||
Do not use semantic versioning for content. Git history tracks every change.
|
||
`maturity` captures epistemic state; `last_updated` tracks recency.
|
||
|
||
### Page types and directories
|
||
|
||
| Type | Directory | Description |
|
||
|------|-----------|-------------|
|
||
| `source` | `wiki/sources/` | One page per processed raw source |
|
||
| `entity` | `wiki/entities/` | People, tools, organisations, projects |
|
||
| `concept` | `wiki/concepts/` | Patterns, theories, architectural decisions |
|
||
| `query` | `wiki/queries/` | Preserved answers and analyses |
|
||
| `conflict` | `wiki/queries/conflict-*.md` | Unresolved contradictions |
|
||
| `private` | `wiki/private/` | Private synthesis (PRIVATE_CONTEXT: enabled) |
|
||
| `index` | `wiki/index.md` | Primary navigation catalog (singleton) |
|
||
| `log` | `wiki/log.md` | Operations ledger (singleton) |
|
||
|
||
### Page size limits
|
||
|
||
| Limit | Lines | Action |
|
||
|-------|-------|--------|
|
||
| Soft cap | 400 | Bash linter warns |
|
||
| Hard cap | 800 | Bash linter errors — split the page |
|
||
|
||
These limits ensure pages fit within the LLM context window without attention degradation
|
||
and keep the wiki atomically navigable.
|
||
|
||
### Linking conventions
|
||
|
||
| Type | Format |
|
||
|------|--------|
|
||
| Internal (same genome) | `[[folder/slug]]` — Obsidian wikilinks only |
|
||
| Cross-genome | `[[../genome-target/wiki/folder/slug]]` |
|
||
| External | `[text](https://url)` — standard Markdown |
|
||
|
||
Never use `[text](relative/path)` for internal references. Obsidian wikilinks are
|
||
bidirectional and appear in the graph view.
|
||
|
||
### Log format
|
||
|
||
Every operation appends one entry to `wiki/log.md`:
|
||
|
||
```markdown
|
||
## [YYYY-MM-DD] TYPE | Subject
|
||
|
||
- run_id: `<uuid>`
|
||
- model: `<model-name>`
|
||
- context_read: `[[path/A]]`, `[[path/B]]`
|
||
- output_written: `[[path/C]]`
|
||
- reasoning: One sentence — what changed and why.
|
||
```
|
||
|
||
Valid TYPEs: `INGEST` `LINT` `QUERY` `CONFLICT` `CONFIG` `SECURITY`
|
||
|
||
Parse examples:
|
||
```bash
|
||
grep "^## \[" wiki/log.md | tail -5 # Last 5 entries
|
||
grep "^## \[" wiki/log.md | grep "CONFLICT" # All conflicts
|
||
grep "^## \[2026-05" wiki/log.md # Entries from a specific month
|
||
```
|
||
|
||
The orchestrator always injects only `tail -n 20 wiki/log.md` into agent context.
|
||
The LLM never loads the full log.
|
||
|
||
---
|
||
|
||
## Collaboration Model
|
||
|
||
| Role | Key access | Permitted operations |
|
||
|------|-----------|----------------------|
|
||
| Owner | Full — key holder | Read/write everywhere |
|
||
| Collaborator | None | Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/` |
|
||
| Local AI agent | Conditional | `private/` only when `PRIVATE_CONTEXT: enabled` |
|
||
| Cloud AI model | Never | `PRIVATE_CONTEXT` must be `disabled`; private data stays on local network |
|
||
|
||
Grant collaborator access: add as Forgejo contributor with Write role.
|
||
Never share the git-crypt key — collaborators operate exclusively in public directories.
|
||
|
||
---
|
||
|
||
## Optional Extensions
|
||
|
||
### qmd — local Markdown search
|
||
|
||
[qmd](https://github.com/tobi/qmd) is a local, on-device BM25 + vector search
|
||
engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls)
|
||
and an MCP server (for native LLM tool use).
|
||
|
||
Recommended at scale: once a genome exceeds ~150 pages, `qmd search` is significantly
|
||
faster and more accurate than navigating `wiki/index.md` manually.
|
||
|
||
```bash
|
||
# Index a genome
|
||
qmd index genome-dev/wiki/
|
||
|
||
# Search
|
||
qmd search "graph-based state management"
|
||
|
||
# Start MCP server (for Claude Code / Codex integration)
|
||
qmd serve --port 3333
|
||
```
|
||
|
||
### Obsidian integration
|
||
|
||
Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.
|
||
|
||
Recommended setup:
|
||
- **Graph view** — visualise page connections; spot orphans and hubs instantly
|
||
- **Obsidian Web Clipper** — browser extension to clip articles directly to `raw/articles/`
|
||
as Markdown
|
||
- **Download attachments** — Settings → Hotkeys → "Download attachments for current file".
|
||
Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to `raw/assets/`
|
||
- **Dataview plugin** — query YAML frontmatter across the wiki;
|
||
`TABLE maturity, last_updated WHERE domain = "genome-dev"` generates dynamic tables
|
||
- **Marp plugin** — render Markdown as slide decks directly from wiki content
|
||
|
||
Note: `.obsidian/` is in `.gitignore`. Workspace and plugin settings are local — not synced.
|
||
|
||
### n8n automation
|
||
|
||
n8n (running on the storage node) can automate the ingest pipeline:
|
||
|
||
1. Forgejo webhook fires on push to a genome's `raw/` directory
|
||
2. n8n flow identifies new files
|
||
3. For each new file: starts one agent session (sequential — never parallel)
|
||
4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
|
||
5. Agent runs ingest workflow and opens PR
|
||
6. Human reviews the PR
|
||
|
||
Key constraint: one source per session, sessions sequential.
|
||
Never batch multiple sources into one agent session.
|
||
|
||
### Intel NPU offloading
|
||
|
||
If the AI compute node has an Intel NPU (e.g. Core Ultra series):
|
||
|
||
- Background tasks (embedding updates, index refresh) → Intel NPU via OpenVINO
|
||
- Active reasoning sessions (ingest, query, synthesis) → GPU
|
||
|
||
This keeps the GPU's KV cache free for interactive work and reduces power consumption
|
||
for background operations.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### `git-crypt: command not found`
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt install git-crypt
|
||
|
||
# macOS
|
||
brew install git-crypt
|
||
```
|
||
|
||
### `make setup` fails with "MISSING: jq"
|
||
|
||
```bash
|
||
make doctor # identifies all missing tools
|
||
sudo apt install git git-crypt curl jq
|
||
```
|
||
|
||
### Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"
|
||
|
||
The staged file is in a path matching `**/private/**` but is not encrypted.
|
||
|
||
Fix options:
|
||
1. Verify `.gitattributes` contains `**/private/** filter=git-crypt diff=git-crypt -text`
|
||
2. Run `git-crypt init` if git-crypt is not initialised in this repo
|
||
3. Run `git-crypt status` to check the encryption state of all files
|
||
|
||
Never use `git commit --no-verify` to bypass this check.
|
||
|
||
### `git-crypt status` shows files as "not encrypted" after init
|
||
|
||
The `.gitattributes` rule must be committed before files in `private/` are staged.
|
||
If files were staged before `.gitattributes` was committed:
|
||
|
||
```bash
|
||
git rm -r --cached raw/private/ wiki/private/
|
||
git add raw/private/ wiki/private/
|
||
git commit -m "fix: re-stage private files for encryption"
|
||
```
|
||
|
||
### Agent returns stale or missing cross-references
|
||
|
||
Likely causes:
|
||
1. Session was too long — KV cache degraded. Use one source per session.
|
||
2. `wiki/index.md` was not read at session start — agent lacked the page catalog.
|
||
3. qmd index is stale — re-index: `qmd index <genome>/wiki/`
|
||
|
||
### Submodules show as "modified" after `make sync`
|
||
|
||
This is normal if genome repos have new commits. Update master's pointers:
|
||
|
||
```bash
|
||
cd master-knowledge-genome
|
||
git add .
|
||
git commit -m "chore: update submodule pointers"
|
||
git push
|
||
```
|
||
|
||
### bw unlock fails
|
||
|
||
Verify you are using `bw` (standard Bitwarden CLI), not `bws` (Secrets Manager CLI).
|
||
`bws` does not work with self-hosted Vaultwarden.
|
||
|
||
```bash
|
||
bw --version # should print e.g. "2024.x.x"
|
||
bw config server https://vault.yourserver.com
|
||
bw login
|
||
```
|