1222 lines
50 KiB
Markdown
1222 lines
50 KiB
Markdown
# Knowledge Genome System
|
||
|
||
> A distributed, encrypted, multi-domain personal knowledge base.
|
||
> No vector database. No embedding pipeline. No external retrieval server.
|
||
|
||
Built on the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
|
||
by Andrej Karpathy — extended with a multi-domain submodule architecture,
|
||
AES-256-CTR encryption via git-crypt, Vaultwarden runtime key injection,
|
||
and a human-in-the-loop Git Flow for quality control.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
1. [Core Philosophy](#core-philosophy)
|
||
2. [Architecture](#architecture)
|
||
3. [System Requirements](#system-requirements)
|
||
4. [Prerequisites](#prerequisites)
|
||
5. [Configuration](#configuration)
|
||
6. [Quick Start](#quick-start)
|
||
7. [Makefile Reference](#makefile-reference)
|
||
8. [Testing](#testing)
|
||
9. [Genome Lifecycle](#genome-lifecycle)
|
||
10. [Security Model](#security-model)
|
||
11. [Key Management](#key-management)
|
||
12. [Agent Sessions](#agent-sessions)
|
||
13. [Workflows](#workflows)
|
||
14. [Knowledge Quality](#knowledge-quality)
|
||
15. [Knowledge Schema](#knowledge-schema)
|
||
16. [Collaboration Model](#collaboration-model)
|
||
17. [Optional Extensions](#optional-extensions)
|
||
18. [Troubleshooting](#troubleshooting)
|
||
|
||
---
|
||
|
||
## Core Philosophy
|
||
|
||
Most RAG systems make the LLM rediscover knowledge from scratch on every query.
|
||
A document is indexed; at query time, relevant chunks are retrieved; an answer is generated.
|
||
Nothing accumulates. Ask a question requiring synthesis across five documents and the LLM
|
||
pieces it together from fragments every single time.
|
||
|
||
This system is different. Instead of retrieval at query time, the LLM
|
||
**incrementally builds and maintains a persistent wiki** that sits between you and the raw
|
||
sources. When a new source arrives, the LLM reads it, extracts key information, updates
|
||
entity and concept pages, flags contradictions with existing claims, and strengthens the
|
||
evolving synthesis. Knowledge is compiled once and kept current.
|
||
|
||
**The wiki is a compounding artifact.** Cross-references are already there.
|
||
Contradictions have been flagged. The synthesis already reflects everything ingested.
|
||
|
||
This means:
|
||
|
||
- No vector database.
|
||
- No embedding pipeline.
|
||
- No external retrieval infrastructure.
|
||
|
||
The `wiki/index.md` of each genome is the retrieval layer. At moderate scale
|
||
(~100 sources, hundreds of pages) this performs better than RAG because cross-references,
|
||
contradictions, and syntheses are already resolved — not re-derived per query.
|
||
|
||
The human's job: curate sources, direct analysis, ask good questions, review PRs.
|
||
The LLM's job: everything else — summarising, cross-referencing, filing, maintaining consistency.
|
||
|
||
---
|
||
|
||
## Architecture
|
||
|
||
### Repository structure
|
||
|
||
```text
|
||
master-knowledge-genome/ ← Root orchestrator (submodule registry)
|
||
├── core-karpathy/ ← LLM Wiki reference pattern (read-only submodule)
|
||
├── genome-dev/ ← Submodule: web development, Angular, TUI
|
||
├── genome-finance/ ← Submodule: personal finance, investments
|
||
├── genome-homelab/ ← Submodule: Keru infrastructure, network configs
|
||
└── AGENTS.md ← Global coordination schema (cross-genome rules)
|
||
```
|
||
|
||
Each genome is an independent git repository:
|
||
|
||
```text
|
||
genome-{name}/
|
||
├── .gitattributes ← Encryption rules — **/private/** wildcard
|
||
├── .gitignore
|
||
├── .git/hooks/pre-commit ← Security hook (dynamic git check-attr)
|
||
├── AGENTS.md ← Per-genome agent contract and workflow rules
|
||
│
|
||
├── raw/ ← Immutable sources — LLM reads, never writes
|
||
│ ├── articles/ ← Web clips, saved articles
|
||
│ ├── transcripts/ ← Audio/video transcripts
|
||
│ ├── code-packs/ ← Code snippets and repositories
|
||
│ ├── assets/ ← Images, PDFs, binary files
|
||
│ └── private/ ← AES-256-CTR encrypted — owner only
|
||
│
|
||
└── wiki/ ← LLM-owned — agent creates and maintains
|
||
├── index.md ← Primary catalog (read first every session)
|
||
├── log.md ← Append-only operations ledger
|
||
├── sources/ ← One page per processed raw source
|
||
├── entities/ ← People, tools, organisations, projects
|
||
├── concepts/ ← Patterns, theories, architectural decisions
|
||
├── queries/ ← Preserved answers and conflict notes
|
||
└── private/ ← AES-256-CTR encrypted — owner only
|
||
```
|
||
|
||
### Three layers
|
||
|
||
| Layer | Path | Owner | Rule |
|
||
| ----------- | ----------- | ----------- | ----------------------------------------------------- |
|
||
| Raw sources | `raw/` | Human | Immutable. LLM reads only. Never modified. |
|
||
| Wiki | `wiki/` | LLM | Agent creates, updates, cross-links, maintains. |
|
||
| Schema | `AGENTS.md` | Human + LLM | Co-evolved contract defining structure and workflows. |
|
||
|
||
### Linked projects (optional)
|
||
|
||
A genome can optionally declare a **linked project repository** — a separate repo where
|
||
the knowledge in that genome is meant to be applied (e.g. `genome-dev` linked to an app
|
||
repo). The link is recorded as a third field in the registry and rendered into the
|
||
genome's `AGENTS.md` (`## Linked Project`). A genome with no link is _knowledge-only_ and
|
||
behaves exactly as before. See [Configuration](#configuration).
|
||
|
||
### Framework structure
|
||
|
||
```text
|
||
knowledge-genome-orchestrator/ ← This repository (setup tooling)
|
||
├── globals.env ← Static KEY=VALUE config (Make-includable)
|
||
├── registry.sh ← Bash-only: GENOMES array + dynamic paths
|
||
├── Makefile ← Entry point for all operations
|
||
├── lib/
|
||
│ ├── output.sh ← Terminal helpers (colors, log levels)
|
||
│ ├── deps.sh ← Dependency validation
|
||
│ ├── scaffold.sh ← Template rendering engine
|
||
│ ├── structure.sh ← Canonical genome layout (single source of truth)
|
||
│ ├── lint.sh ← Per-file validation functions
|
||
│ └── git-crypt.sh ← git-crypt lifecycle (init, export, verify, rotate)
|
||
├── providers/
|
||
│ ├── forgejo.sh ← Forgejo REST API provider
|
||
│ └── github.sh ← GitHub REST API provider
|
||
├── scripts/
|
||
│ ├── setup.sh ← Main entry point
|
||
│ ├── setup-master.sh ← Master repo initialisation
|
||
│ ├── setup-genomes.sh ← Genome provisioning loop
|
||
│ ├── add-genome.sh ← Add a single new genome
|
||
│ ├── lint-genomes.sh ← Quality control across all genomes
|
||
│ └── verify-genomes.sh ← Structure verify / --sync across all genomes
|
||
├── templates/
|
||
│ ├── agents-genome.md ← Per-genome agent contract template
|
||
│ ├── agents-master.md ← Master coordination schema template
|
||
│ ├── readme-master.md ← Master repo README template
|
||
│ ├── wiki-index.md ← Index template (rendered per genome)
|
||
│ ├── wiki-log.md ← Log template (rendered per genome)
|
||
│ ├── pr-description.md ← PR review checklist template
|
||
│ ├── pre-commit.sh ← Security hook template
|
||
│ ├── gitattributes ← Git encryption rules template
|
||
│ └── gitignore ← Git ignore template
|
||
├── skills/
|
||
│ └── ingest/ ← pi skill: deployed to the AI node (vm101)
|
||
│ ├── SKILL.md ← Semantic-only contract (read/edit, emits manifest)
|
||
│ ├── references/ ← On-demand reference docs for the agent
|
||
│ └── scripts/ ← Deterministic post-processor (runs outside the agent)
|
||
│ ├── run-ingest.sh ← Orchestrator: consumes the manifest, emits one JSON line
|
||
│ ├── slug.sh ← Slug normalisation
|
||
│ ├── index-append.py ← Sorted insert into wiki/index.md + last_updated bump
|
||
│ ├── log-append.sh ← Append a wiki/log.md entry
|
||
│ ├── scoped-lint.sh ← Lint only the pages touched this run (reuses lib/lint.sh)
|
||
│ └── open-pr.sh ← Branch / commit / push / open PR (DRY_RUN seam for tests)
|
||
└── tests/ ← bats suite — deterministic, no LLM/GPU (see Testing)
|
||
├── helpers.bash
|
||
├── scripts.bats
|
||
├── lint.bats
|
||
├── structure.bats
|
||
└── run-ingest.bats
|
||
```
|
||
|
||
> The `skills/ingest/` directory is version-controlled here but **deployed** to the AI
|
||
> node (vm101) under `~/.pi/agent/skills/ingest`. The agent (`pi`) does only semantic work
|
||
> and writes a manifest; `run-ingest.sh` does the mechanical steps. See [Workflows → Ingest](#ingest).
|
||
|
||
---
|
||
|
||
## System Requirements
|
||
|
||
### Linux — full support (primary target)
|
||
|
||
All scripts are written for GNU/bash on Linux. Tested on Ubuntu 22.04+.
|
||
All tools (git-crypt, bw, qmd) have native Linux binaries.
|
||
|
||
### macOS — full support
|
||
|
||
All scripts are compatible with macOS. Requirements:
|
||
|
||
- bash 3.2+ (macOS default) — supported for the **setup scripts** (`make` targets, scaffolding).
|
||
Two things need bash 4+: the `ingest` skill (`mapfile`), which runs on the Linux AI node (not a
|
||
constraint on the macOS setup machine); and `gcrypt_rotate_key` (`compgen -G`), which **does**
|
||
run on the laptop. For key rotation on macOS, use Homebrew bash (`brew install bash`).
|
||
- GNU coreutils not required — BSD variants of `date`, `grep`, `sed` all handled.
|
||
- `git-crypt`: install via Homebrew — `brew install git-crypt`
|
||
- `jq`, `curl`: pre-installed or via Homebrew
|
||
|
||
If you use Homebrew bash (`brew install bash`), the scripts work identically to Linux.
|
||
|
||
### Windows — WSL2 only
|
||
|
||
**Git Bash and native Windows are not supported.**
|
||
|
||
Reasons:
|
||
|
||
- `git-crypt` has no native Windows binary.
|
||
- Process substitution `<(...)` used for runtime key injection is not available
|
||
in Git Bash or PowerShell.
|
||
- Several bash builtins used throughout (`compgen`, `BASH_SOURCE`, arrays) are not
|
||
available outside a POSIX-compliant shell.
|
||
|
||
**WSL2 (Windows Subsystem for Linux)** with Ubuntu gives full compatibility.
|
||
All setup and runtime operations work identically to native Linux inside WSL2.
|
||
|
||
### Hardware recommendations
|
||
|
||
The system is designed for a homelab architecture:
|
||
|
||
| Component | Recommended | Role |
|
||
| --------------- | ------------------------- | --------------------------------------------------------------- |
|
||
| Storage node | Any Linux server with NFS | Hosts Forgejo, stores genome repos |
|
||
| AI compute node | GPU server (16GB+ VRAM) | Runs local LLM agent sessions |
|
||
| VRAM | 16GB minimum | 14B model at Q5_K_M ≈ 10GB weights; ~6GB for KV cache |
|
||
| Local LLM | 14B–32B quantised | Active wiki maintenance sessions |
|
||
| Large LLM | 70B (async) | Deep reflection, complex synthesis (scheduled, not interactive) |
|
||
|
||
> **On VRAM constraints:** with a 16GB card and a 14B model, the KV cache budget
|
||
> is ~6GB — approximately 32k tokens of effective context. Every token in `AGENTS.md`,
|
||
> the index, and the log tail is a cost. This is why all agent files are token-optimised
|
||
> and sessions are kept to one source at a time.
|
||
|
||
> **Reference deployment:** the table above is a target profile, not a hard requirement.
|
||
> The current setup runs a single 16GB GPU (RTX 5060 Ti) with a ~9B model for interactive
|
||
> ingest, and offloads heavy/async synthesis to a cloud model. Smaller models work — they
|
||
> just make the "one source per session" discipline and the token budget matter more.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
### Required
|
||
|
||
| Tool | Purpose |
|
||
| ----------- | -------------------------------- |
|
||
| `git` | Version control |
|
||
| `git-crypt` | Transparent file encryption |
|
||
| `curl` | REST API calls to Forgejo/GitHub |
|
||
| `jq` | JSON parsing |
|
||
|
||
### Optional
|
||
|
||
| Tool | Purpose |
|
||
| ----- | ----------------------------------------------------------------------- |
|
||
| `bw` | Bitwarden CLI — runtime key injection from Vaultwarden (no key on disk) |
|
||
| `qmd` | Local BM25 + vector search for Markdown files with MCP server interface |
|
||
|
||
> **`bw` vs `bws`:** Use `bw` (standard Bitwarden CLI). `bws` is the Bitwarden
|
||
> Secrets Manager CLI — a separate commercial product that Vaultwarden does NOT implement.
|
||
|
||
### Install on Ubuntu/Debian
|
||
|
||
```bash
|
||
sudo apt update && sudo apt install -y git git-crypt curl jq
|
||
```
|
||
|
||
### Install on macOS
|
||
|
||
```bash
|
||
brew install git git-crypt curl jq
|
||
```
|
||
|
||
### Install Bitwarden CLI
|
||
|
||
```bash
|
||
# Linux
|
||
npm install -g @bitwarden/cli
|
||
|
||
# macOS
|
||
brew install bitwarden-cli
|
||
```
|
||
|
||
### Verify all tools
|
||
|
||
```bash
|
||
make doctor
|
||
```
|
||
|
||
---
|
||
|
||
## Configuration
|
||
|
||
Configuration is split into two files with distinct purposes:
|
||
|
||
### `globals.env` — static KEY=VALUE
|
||
|
||
Safe for `make include`, `docker-compose`, shell `source`, and any standard env parser.
|
||
Contains only simple scalar values — no bash syntax, no arrays.
|
||
|
||
```bash
|
||
# Provider selection
|
||
PROVIDER=forgejo # forgejo | github
|
||
|
||
# Forgejo (active when PROVIDER=forgejo)
|
||
FORGEJO_URL=https://git.yourserver.com
|
||
FORGEJO_USER=yourusername
|
||
FORGEJO_SSH_PORT=222 # Default for many homelab Forgejo setups; 22 for standard
|
||
|
||
# GitHub (active when PROVIDER=github — uncomment to use)
|
||
# GITHUB_USER=your-username
|
||
# GITHUB_ORG=your-org # Optional: for org repos, overrides GITHUB_USER
|
||
|
||
# Vaultwarden
|
||
VAULTWARDEN_URL=https://vault.yourserver.com
|
||
|
||
# Master repository
|
||
MASTER_REPO=master-knowledge-genome
|
||
GIST_URL=https://gist.github.com/442a6bf555914893e9891c11519de94f.git
|
||
```
|
||
|
||
### `registry.sh` — bash runtime config
|
||
|
||
Sourced by shell scripts only. Contains the genome registry array and dynamic path
|
||
resolution. Never included by Make.
|
||
|
||
```bash
|
||
# Dynamic paths (resolved at source time)
|
||
WORK_DIR="${HOME}/knowledge-genome-orchestrator"
|
||
KEYS_DIR="${WORK_DIR}/keys"
|
||
|
||
# Genome registry — format: "name|description|linked_repo"
|
||
# The third and fourth fields are OPTIONAL:
|
||
# - leave it empty → knowledge-only genome (no linked project)
|
||
# - owner/repo → genome is linked to that project repository (rendered into AGENTS.md)
|
||
# - cross_source → yes|no (default no): whether the cross-genome collector may read this genome as a source
|
||
GENOMES=(
|
||
"genome-dev|Web development, TUI, Angular, software architecture|myorg/my-app|no"
|
||
"genome-finance|Personal finance, investments, market analysis||no"
|
||
"genome-homelab|Infrastructure, network configs, architecture logs||no"
|
||
)
|
||
```
|
||
|
||
To add a genome to the registry before running setup, append a line to `GENOMES`.
|
||
After initial setup, use `make add-genome` instead.
|
||
|
||
### Tokens
|
||
|
||
Tokens are never stored in config files. Export them in your shell before running setup:
|
||
|
||
```bash
|
||
export FORGEJO_TOKEN="your_forgejo_token"
|
||
# or
|
||
export GITHUB_TOKEN="your_github_token"
|
||
```
|
||
|
||
---
|
||
|
||
## Quick Start
|
||
|
||
```bash
|
||
# 1. Clone the setup framework
|
||
git clone <setup-repo-url> knowledge-genome-orchestrator
|
||
cd knowledge-genome-orchestrator
|
||
|
||
# 2. Configure your environment
|
||
cp globals.env.example globals.env # edit with your values
|
||
# Edit registry.sh to define your genomes
|
||
|
||
# 3. Export your provider token
|
||
export FORGEJO_TOKEN="your_token_here"
|
||
|
||
# 4. Verify dependencies
|
||
make doctor
|
||
|
||
# 5. Run full setup
|
||
make setup
|
||
```
|
||
|
||
`make setup` executes in order:
|
||
|
||
1. **Dependency check** — verifies all required tools are installed
|
||
2. **Git identity check** — warns if `user.name` / `user.email` are not configured
|
||
3. **Master repo** — creates `master-knowledge-genome` on Forgejo, scaffolds with
|
||
`AGENTS.md` and `README.md`, initialises git, adds `core-karpathy` as submodule, pushes
|
||
4. **Genome provisioning** — for each genome in `registry.sh`:
|
||
- Creates remote repository on Forgejo
|
||
- Adds it as a submodule in the master repo
|
||
- Initialises git-crypt (**before any files are created**)
|
||
- Scaffolds directory structure and renders all templates
|
||
- Installs pre-commit security hook
|
||
- Commits, pushes genome to remote
|
||
- Exports symmetric key to `keys/<genome>.key`
|
||
- Prints Vaultwarden upload instructions
|
||
- Commits submodule pointer in master repo
|
||
|
||
After setup completes:
|
||
|
||
- Upload all files in `keys/` to Vaultwarden (see Key Management)
|
||
- Delete key files from disk: `rm keys/*.key`
|
||
|
||
---
|
||
|
||
## Makefile Reference
|
||
|
||
| Target | Description |
|
||
| ----------------------------------------------------- | ------------------------------------------------------------------------------------- |
|
||
| `make setup` | Full system initialisation — master repo + all genomes in `registry.sh` |
|
||
| `make add-genome NAME=x DESC="y" [LINKED=owner/repo]` | Scaffold and register a single new genome (optional linked project) |
|
||
| `make lint` | Run quality checks across all genomes (schema, privacy, decay, page size) |
|
||
| `make verify-structure` | Report directory drift of each genome vs the canonical layout (`lib/structure.sh`) |
|
||
| `make sync-structure` | Create any missing canonical directories across all genomes (safe, idempotent) |
|
||
| `make test` | Run the bats test suite (deterministic; no LLM/GPU/network) — see [Testing](#testing) |
|
||
| `make status` | Show submodule status and per-genome git-crypt encryption state |
|
||
| `make lock` | Lock all encrypted repos (master + all genome submodules) |
|
||
| `make doctor` | Verify required tools: git, git-crypt, curl, jq; warn if bw missing |
|
||
| `make sync` | `git submodule update --init --recursive` + report unpushed commits per genome |
|
||
| `make help` | Print all available targets |
|
||
|
||
### Examples
|
||
|
||
```bash
|
||
# Check system health
|
||
make doctor
|
||
|
||
# Add a new genome after initial setup
|
||
make add-genome NAME=genome-research DESC="Academic papers and deep research"
|
||
|
||
# Add a genome linked to a project repository
|
||
make add-genome NAME=genome-dev DESC="Web development" LINKED=myorg/my-app
|
||
|
||
# Check every genome against the canonical directory layout
|
||
make verify-structure
|
||
|
||
# Run full lint pass (bash deterministic checks)
|
||
make lint
|
||
|
||
# Sync all nodes after pulling on another machine
|
||
make sync
|
||
|
||
# Emergency lock — secures all repos before leaving a session
|
||
make lock
|
||
```
|
||
|
||
---
|
||
|
||
## Testing
|
||
|
||
The mechanical layer (slug, index, log, lint, structure, the ingest orchestrator) is
|
||
covered by a [bats](https://github.com/bats-core/bats-core) suite. The tests are
|
||
**deterministic and have zero dependency on the LLM, the GPU, or the network** — they
|
||
simulate the agent's output with fixtures and exercise the scripts directly, so they run
|
||
anywhere git + bash live (laptop, CI, a git hook). They are **not** meant to run on the AI
|
||
node or via n8n.
|
||
|
||
```bash
|
||
sudo apt install bats # once
|
||
make test # or: bats tests/
|
||
```
|
||
|
||
| File | Covers |
|
||
| ----------------- | ------------------------------------------------------------------------------ |
|
||
| `scripts.bats` | `slug.sh`, `log-append.sh`, `index-append.py` (insert, sort, bump, idempotent) |
|
||
| `lint.bats` | `lib/lint.sh` validators + `scoped-lint.sh` |
|
||
| `structure.bats` | `lib/structure.sh` report / sync |
|
||
| `run-ingest.bats` | `run-ingest.sh` end-to-end (DRY_RUN, local bare remote) — needs `jq` |
|
||
|
||
Each test builds its own throwaway genome with a local bare remote, configured to ignore
|
||
the operator's global git settings (signing, global hooks) so the suite is hermetic. The
|
||
`run-ingest` tests auto-`skip` if `jq` is absent. If you change the canonical layout in
|
||
`lib/structure.sh`, update `FIXTURE_DIRS` in `tests/helpers.bash` to match.
|
||
|
||
> Why this matters: the only non-deterministic part of the system is the model. Pinning
|
||
> the mechanical layer with tests means that when an ingest misbehaves, you know it's the
|
||
> model or the prompt — not the plumbing.
|
||
|
||
---
|
||
|
||
## Genome Lifecycle
|
||
|
||
### Initial setup
|
||
|
||
All genomes defined in `registry.sh` are provisioned by `make setup`.
|
||
|
||
### Adding a genome after initial setup
|
||
|
||
```bash
|
||
make add-genome NAME=genome-newname DESC="Domain description"
|
||
```
|
||
|
||
This: creates the remote repo, adds it as a submodule, initialises git-crypt,
|
||
scaffolds the directory structure, installs the pre-commit hook, commits and pushes,
|
||
exports the key, and commits the submodule pointer in master.
|
||
|
||
After adding: upload the new key to Vaultwarden and delete the key file.
|
||
|
||
### Removing a genome
|
||
|
||
Manual process:
|
||
|
||
```bash
|
||
# In master repo
|
||
git submodule deinit genome-name
|
||
git rm genome-name
|
||
git commit -m "chore: remove genome-name submodule"
|
||
git push
|
||
# Archive or delete the remote repository on Forgejo
|
||
```
|
||
|
||
### Template rendering
|
||
|
||
When a genome is scaffolded, `render_template` replaces these placeholders in all
|
||
template files:
|
||
|
||
| Placeholder | Source | Example |
|
||
| ----------------------- | ----------- | ------------------------------ |
|
||
| `{{GENOME_NAME}}` | registry.sh | `genome-dev` |
|
||
| `{{GENOME_NAME_UPPER}}` | derived | `GENOME-DEV` |
|
||
| `{{GENOME_DESC}}` | registry.sh | `Web development...` |
|
||
| `{{LINKED_PROJECT}}` | registry.sh | `myorg/my-app` (or `none`) |
|
||
| `{{FORGEJO_URL}}` | globals.env | `https://git.yourserver.com` |
|
||
| `{{FORGEJO_USER}}` | globals.env | `yourusername` |
|
||
| `{{VAULTWARDEN_URL}}` | globals.env | `https://vault.yourserver.com` |
|
||
| `{{MASTER_REPO}}` | globals.env | `master-knowledge-genome` |
|
||
| `{{DATE}}` | runtime | `2026-05-11` |
|
||
|
||
---
|
||
|
||
## Security Model
|
||
|
||
### Encryption architecture
|
||
|
||
Each genome uses a unique symmetric AES-256-CTR key managed by git-crypt.
|
||
Two directories in every genome are always encrypted:
|
||
|
||
| Directory | Contents | On remote |
|
||
| --------------- | --------------------------- | ------------------ |
|
||
| `raw/private/` | Sensitive source material | Opaque binary blob |
|
||
| `wiki/private/` | Private synthesis and notes | Opaque binary blob |
|
||
|
||
All other directories (`raw/articles/`, `wiki/sources/`, etc.) are plaintext.
|
||
Collaborators without the key can contribute to public directories normally —
|
||
git handles encrypted files transparently.
|
||
|
||
### `.gitattributes` — dynamic encryption rules
|
||
|
||
Encryption rules use a glob wildcard that catches any `private/` directory at
|
||
any depth in the repository — including directories created at runtime by the LLM:
|
||
|
||
```gitattributes
|
||
# Text rules first
|
||
*.md text eol=lf
|
||
*.sh text eol=lf
|
||
|
||
# Encryption rules LAST (later rules override per-attribute)
|
||
# **/private/** ensures -text overrides *.md text=lf, preventing EOL corruption
|
||
**/private/** filter=git-crypt diff=git-crypt -text
|
||
```
|
||
|
||
> Rule ordering matters: in `.gitattributes`, the last matching rule wins per attribute.
|
||
> Encryption rules must come after text rules so `-text` overrides `text eol=lf`
|
||
> for encrypted markdown files.
|
||
|
||
### Pre-commit hook — dynamic validation
|
||
|
||
The security hook installed at `.git/hooks/pre-commit` validates every staged file
|
||
dynamically — it reads encryption requirements from `.gitattributes` at runtime
|
||
rather than checking hardcoded paths:
|
||
|
||
```bash
|
||
# For each staged file, check if git-crypt encryption is required
|
||
filter=$(git check-attr filter -- "$file" | sed 's/.*: //')
|
||
if [[ "$filter" == "git-crypt" ]]; then
|
||
# Verify the file is actually encrypted
|
||
if git-crypt status "$file" | grep -q "not encrypted"; then
|
||
# BLOCK THE COMMIT
|
||
fi
|
||
fi
|
||
```
|
||
|
||
This means: any file matching `**/private/**` in `.gitattributes` is protected,
|
||
including future `private/` directories created anywhere in the repo.
|
||
The hook never needs updating when the encryption rules change.
|
||
|
||
### Untrusted agent output — manifest validation
|
||
|
||
The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
|
||
a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
|
||
**validates the manifest before trusting any field** — it must be well-formed JSON with a
|
||
string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
|
||
with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
|
||
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
|
||
knowledge tree. This is the trust boundary between the (stochastic) model and the
|
||
(deterministic, tested) post-processor.
|
||
|
||
### PRIVATE_CONTEXT toggle
|
||
|
||
The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
|
||
accesses encrypted directories. It must be declared explicitly by the operator
|
||
at the start of every session:
|
||
|
||
```text
|
||
PRIVATE_CONTEXT: disabled ← Default. private/ directories are treated as non-existent.
|
||
PRIVATE_CONTEXT: enabled ← Agent may read/write private/. Requires git-crypt unlock.
|
||
```
|
||
|
||
Rules:
|
||
|
||
- Never inferred. Never carried over from a previous session.
|
||
- `enabled` requires the operator to confirm that `git-crypt unlock` has run on the host.
|
||
- Per-genome, per-session: enabling for `genome-finance` does NOT enable for `genome-dev`.
|
||
- Cloud LLM models: `PRIVATE_CONTEXT` must always be `disabled`. Private data never leaves the local network.
|
||
- All outputs derived from private data are prefixed `[PRIVATE DATA INCLUDED]`.
|
||
- Private synthesis goes exclusively to `wiki/private/` — never to public wiki paths.
|
||
|
||
### Runtime key injection — zero disk policy
|
||
|
||
Encryption keys are never stored as persistent files on the AI server.
|
||
They are injected at session start via the Bitwarden CLI (`bw`) against
|
||
your self-hosted Vaultwarden instance, using process substitution:
|
||
|
||
```bash
|
||
# Step 1: authenticate
|
||
bw config server https://vault.yourserver.com
|
||
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
|
||
|
||
# Step 2: unlock genome (key lives only in a kernel file descriptor — never touches disk)
|
||
git-crypt unlock <(
|
||
bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d
|
||
)
|
||
```
|
||
|
||
The key flows: Vaultwarden → `bw get notes` → `base64 -d` → kernel pipe → `git-crypt`.
|
||
At no point is the key written to any file on disk.
|
||
|
||
Lock a genome when the session ends:
|
||
|
||
```bash
|
||
git-crypt lock
|
||
```
|
||
|
||
---
|
||
|
||
## Key Management
|
||
|
||
> This section is for the operator. These commands are never issued by the LLM agent.
|
||
|
||
### Vaultwarden Secure Notes
|
||
|
||
Each genome key is stored as a base64-encoded Secure Note in Vaultwarden:
|
||
|
||
| Genome | Vaultwarden Note Name |
|
||
| ---------------- | --------------------- |
|
||
| `genome-dev` | `genome-dev key` |
|
||
| `genome-finance` | `genome-finance key` |
|
||
| `genome-homelab` | `genome-homelab key` |
|
||
|
||
After `make setup` or `make add-genome`, key files are exported to `keys/`.
|
||
Upload procedure:
|
||
|
||
```bash
|
||
# Encode the key
|
||
base64 < keys/genome-dev.key
|
||
|
||
# Paste the output into a Vaultwarden Secure Note named "genome-dev key"
|
||
# Then delete the key file
|
||
rm keys/genome-dev.key
|
||
```
|
||
|
||
### Cloning on a new machine
|
||
|
||
```bash
|
||
# Full clone with all submodules
|
||
git clone --recurse-submodules \
|
||
https://git.yourserver.com/yourusername/master-knowledge-genome.git
|
||
|
||
# Unlock a specific genome (with key file — development only)
|
||
cd master-knowledge-genome/genome-dev
|
||
git-crypt unlock /path/to/genome-dev.key
|
||
|
||
# Unlock via Vaultwarden (recommended — no key on disk)
|
||
export BW_SESSION=$(bw unlock --passwordenv BW_MASTER_PASSWORD --raw)
|
||
git-crypt unlock <(bw get notes "genome-dev key" --session "$BW_SESSION" | base64 -d)
|
||
|
||
# Sparse clone — collaborator who only needs one genome
|
||
git clone https://git.yourserver.com/yourusername/genome-dev.git
|
||
```
|
||
|
||
### Key rotation (emergency)
|
||
|
||
If a key is lost or compromised:
|
||
|
||
```bash
|
||
# From the knowledge-genome-orchestrator/ directory
|
||
source lib/git-crypt.sh
|
||
# If gcrypt_rotate_key operates on the CWD: cd into .../master-knowledge-genome/genome-dev
|
||
# If it navigates by name instead: cd into .../master-knowledge-genome
|
||
cd ~/knowledge-genome-orchestrator/master-knowledge-genome
|
||
gcrypt_rotate_key "genome-dev"
|
||
```
|
||
|
||
> **macOS:** `gcrypt_rotate_key` uses `compgen -G` (bash 4+). The stock macOS bash 3.2 is not
|
||
> enough — run rotation under Homebrew bash (`brew install bash`).
|
||
|
||
`gcrypt_rotate_key` performs:
|
||
|
||
1. Unlocks repo with existing key
|
||
2. Removes old key material
|
||
3. Generates new symmetric key via `git-crypt init`
|
||
4. Re-stages and commits private files (encrypted with new key)
|
||
5. Exports new key to `keys/`
|
||
6. Prints Vaultwarden update instructions
|
||
|
||
> **Limitation:** git history still contains blobs encrypted with the old key.
|
||
> Anyone with the old key and git history access can decrypt them. To purge old
|
||
> encrypted blobs from history:
|
||
>
|
||
> ```bash
|
||
> git filter-repo --invert-paths --path raw/private --path wiki/private
|
||
> git push --force origin main
|
||
> ```
|
||
>
|
||
> This rewrites all commit hashes — coordinate with any collaborators first.
|
||
|
||
After rotation:
|
||
|
||
- Upload new key to Vaultwarden (replace existing note)
|
||
- Delete both `keys/genome-dev.key` and `keys/genome-dev-rotated-*.key` from disk
|
||
- Revoke access from previous key holders
|
||
|
||
---
|
||
|
||
## Agent Sessions
|
||
|
||
### Prerequisites for every session
|
||
|
||
Before starting an LLM agent session on a genome:
|
||
|
||
1. The host (AI server) runs `git-crypt unlock` for the required genomes
|
||
2. The orchestrator prepares context: `tail -n 20 wiki/log.md`
|
||
3. Declare `PRIVATE_CONTEXT` state explicitly in the opening prompt
|
||
|
||
### Session start protocol
|
||
|
||
The agent executes in this order at the start of every session:
|
||
|
||
1. Read `wiki/index.md` — primary catalog of all pages and maturity
|
||
2. Read last 20 log entries (injected by orchestrator — does NOT open `wiki/log.md` directly)
|
||
3. For tasks involving related pages: if the optional `qmd` extension is installed,
|
||
`qmd search "<query>"` before opening files; otherwise navigate from `wiki/index.md`
|
||
4. Operate on individual files — never scan entire directories
|
||
|
||
### One source per session
|
||
|
||
With a 14B model and ~6GB KV cache budget, long sessions degrade.
|
||
As the session extends, the context fills with pages already created,
|
||
attention dilutes, and later entities receive worse cross-references than earlier ones.
|
||
|
||
**Hard rule: one source per session.**
|
||
If multiple sources are queued in `raw/`, process only the first.
|
||
Commit, close the session. The orchestrator (n8n or script) starts a new session
|
||
for the next source with a clean KV cache.
|
||
|
||
For automated pipelines: if 5 files arrive in `raw/`, trigger 5 agent sessions
|
||
sequentially — not one session with 5 files.
|
||
|
||
### n8n automation
|
||
|
||
For Forgejo webhook → automated ingest:
|
||
|
||
1. Forgejo sends webhook on push to `raw/`
|
||
2. n8n receives webhook, identifies new files
|
||
3. n8n starts one agent session per new file (sequential, not parallel)
|
||
4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
|
||
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
|
||
6. Human reviews — **merge to accept**, or close the PR + delete the `feat` branch to reject
|
||
|
||
---
|
||
|
||
## Workflows
|
||
|
||
### Ingest
|
||
|
||
Triggered by a new file in `raw/` (manual or via webhook). Ingest is split into two
|
||
phases so that the small local model spends its limited context only on judgement, and
|
||
all the deterministic bookkeeping happens outside the model's loop.
|
||
|
||
**Phase 1 — agent (semantic only).** The `ingest` skill gives the agent read/edit tools
|
||
only (no shell). It:
|
||
|
||
1. Reads the source once
|
||
2. Creates `wiki/sources/<slug>.md` — summary and key points
|
||
3. Per entity (person, tool, organisation): creates or updates `wiki/entities/<name>.md`
|
||
4. Per concept (pattern, theory, decision): creates or updates `wiki/concepts/<name>.md`
|
||
5. Checks each touched page for contradictions → applies Conflict Resolution if found
|
||
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
|
||
a one-line reasoning, the PR summary, and any contradictions) — then **stops**
|
||
|
||
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
|
||
**validates the manifest** — well-formed JSON, expected shape, and every page path confined to
|
||
`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
|
||
model must not waste context on:
|
||
|
||
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
|
||
deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
|
||
index `last_updated` (`index-append.py`)
|
||
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
|
||
orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
|
||
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
|
||
`lib/lint.sh`)
|
||
10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
|
||
base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
|
||
structure (Summary / Pages / Contradictions / Scoped Lint)
|
||
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
|
||
|
||
The agent never runs git, never edits the index/log mechanically, and never lints — those
|
||
are deterministic and tested (see [Testing](#testing)). Invocation on the AI node:
|
||
|
||
```bash
|
||
pi --mode json -p "/skill:ingest raw/articles/<file>.md" # phase 1 → writes manifest
|
||
run-ingest.sh <genome> # phase 2 → index/log/lint/PR
|
||
```
|
||
|
||
For private sources (`PRIVATE_CONTEXT: enabled` required):
|
||
|
||
- All output goes to `wiki/private/<slug>.md` only
|
||
- PR title: `[PRIVATE] ingest: <slug>`
|
||
|
||
**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
|
||
"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
|
||
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
|
||
the orchestrator, around the human gate:
|
||
|
||
- **Before each session** the orchestrator realigns the checkout to the base
|
||
(`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
|
||
checkout to match the remote, never a force-push to the shared branch.
|
||
- **After the PR opens, everything stops** until a human approves: one source per session,
|
||
sequential, no new ingest until the pending PR is closed.
|
||
- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
|
||
already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
|
||
shared branch.
|
||
|
||
The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
|
||
encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
|
||
branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
|
||
buffer ingests on `develop` and cut manual `develop → main` releases — no code change.
|
||
|
||
### Query
|
||
|
||
Triggered by an operator question.
|
||
|
||
1. `qmd search "<query>"` (if the optional qmd extension is installed) → identify
|
||
candidate pages; otherwise start from `wiki/index.md`
|
||
2. Read candidate pages directly (qmd already returns file paths — no intermediate index lookup)
|
||
3. Synthesise answer with `[[wikilink]]` citations
|
||
4. If answer is non-trivial: save as `wiki/queries/<slug>.md` and append to index
|
||
5. Append log entry: `QUERY | <subject>`
|
||
|
||
For general orientation without a specific query: read `wiki/index.md` directly.
|
||
|
||
### Lint
|
||
|
||
The lint workflow is split between deterministic bash checks and semantic LLM judgment.
|
||
|
||
**Step 1 — operator runs bash linter:**
|
||
|
||
```bash
|
||
make lint
|
||
```
|
||
|
||
The bash linter checks automatically:
|
||
|
||
- YAML frontmatter validity (all mandatory fields present)
|
||
- Domain consistency (domain field matches genome name)
|
||
- Type validity (value from allowed list)
|
||
- Privacy consistency (`private/` directories have `private: true`)
|
||
- Page size (warn at 400 lines, error at 800 lines)
|
||
- Knowledge decay (stable > 180 days, draft > 90 days)
|
||
- Broken internal wikilinks (warnings only — cross-type links produce expected false positives)
|
||
|
||
**Step 2 — operator provides bash output to LLM agent:**
|
||
|
||
The agent applies semantic judgment to findings the bash linter cannot make:
|
||
|
||
- **Orphan pages** (from bash list): for each orphan, identify 1-3 existing pages
|
||
that should link to it; propose specific additions
|
||
- **Implicit concepts** (from bash term frequency list): determine if a candidate
|
||
term warrants a dedicated page; draft stub if yes
|
||
- **Duplicate concepts**: `qmd search "<concept>"` for suspected duplicates;
|
||
propose merge if confirmed
|
||
- **Maturity promotion**: pages with 2+ sources still marked `draft` → propose `stable`
|
||
|
||
The agent reports all findings as a structured list. It does not modify files
|
||
without operator approval. Appends `LINT | <summary>` log entry.
|
||
|
||
---
|
||
|
||
## Knowledge Quality
|
||
|
||
### PR review workflow
|
||
|
||
Every agent session that modifies wiki pages opens a PR.
|
||
The PR description uses `templates/pr-description.md`:
|
||
|
||
```markdown
|
||
## Summary
|
||
|
||
One sentence: goal of this session and source processed.
|
||
|
||
## Pages Created
|
||
|
||
| Path | Type | Maturity |
|
||
|
||
## Pages Modified
|
||
|
||
| Path | Change |
|
||
|
||
## Contradictions Found
|
||
|
||
[ ] None / [ ] n conflict file(s) created
|
||
|
||
## Private Data Accessed
|
||
|
||
[ ] No (PRIVATE_CONTEXT: disabled) / [ ] Yes
|
||
|
||
## Scoped Lint (post-ingest)
|
||
|
||
[ ] Frontmatter valid [ ] No broken links [ ] No issues found
|
||
```
|
||
|
||
This makes human review fast and structured: read the table, scan the diff,
|
||
approve or request changes. No exploration required to understand what the agent did.
|
||
|
||
### Conflict resolution
|
||
|
||
When new evidence contradicts an existing wiki claim:
|
||
|
||
1. Keep the existing page unchanged
|
||
2. Create `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md` with:
|
||
- The existing claim and its source
|
||
- The contradicting evidence and its source
|
||
- Agent confidence assessment for each
|
||
- Recommendation: `accept_b` | `keep_a` | `requires_human_review`
|
||
3. Add entry to `wiki/index.md` → Conflicts Pending Review section
|
||
4. Log entry: `CONFLICT | <concept>`
|
||
5. Open PR: `[CONFLICT] <concept> — human review required`
|
||
|
||
The operator resolves the conflict, updates relevant pages, closes the PR.
|
||
|
||
### Knowledge decay
|
||
|
||
Pages have a `last_updated` field in frontmatter. During lint passes:
|
||
|
||
| Maturity | Threshold | Action |
|
||
| -------- | --------- | -------------------------------------- |
|
||
| `stable` | 180 days | Flag as stale — add `⚠️ STALE` callout |
|
||
| `draft` | 90 days | Flag as stale — add `⚠️ STALE` callout |
|
||
|
||
The agent proposes re-validation but does not change `maturity` without new source evidence.
|
||
|
||
### Cross-genome references
|
||
|
||
Cross-domain knowledge moves by **pull, never push**: the genome you are working in draws
|
||
material _in_; nothing is ever written into another genome. There are **no cross-genome
|
||
wikilinks** — submodule pointers make relative paths brittle.
|
||
|
||
When the working genome needs a concept that lives elsewhere, the **navigation skill** handles
|
||
it in the same two-phase shape as ingest:
|
||
|
||
1. A deterministic collector clones the relevant genomes **read-only at HEAD** (fresh — never the
|
||
pinned submodule state) and assembles a dossier of excerpts with provenance.
|
||
2. A semantic pass reads only that dossier; the skill then deposits **one** abstract, non-private
|
||
raw into the working genome at `raw/articles/crossgen-<topic>-<date>.md`.
|
||
3. That raw goes through the working genome's normal ingest → PR → human gate, like any source.
|
||
|
||
Which genomes may be read as **sources** is gated by a per-genome `cross_source: yes|no` flag: a
|
||
confidential genome (e.g. a client file) is marked `no` and is never read as a source — the wall
|
||
is structural, not a matter of the agent's discipline. The master `AGENTS.md` holds the full
|
||
boundary contract.
|
||
|
||
---
|
||
|
||
## Knowledge Schema
|
||
|
||
### Frontmatter
|
||
|
||
Every wiki page must start with valid YAML frontmatter:
|
||
|
||
```yaml
|
||
---
|
||
title: "Strict String Title"
|
||
type: source | entity | concept | query | conflict | private
|
||
domain: genome-name
|
||
tags: [lowercase, hyphen-separated]
|
||
maturity: draft | stable | deprecated
|
||
last_updated: YYYY-MM-DD
|
||
private: true | false
|
||
---
|
||
```
|
||
|
||
| Field | Rules |
|
||
| ---------------------- | ------------------------------------------------------------------------ |
|
||
| `type` | Must be one of: `source entity concept query conflict private index log` |
|
||
| `maturity: draft` | Single source or unvalidated |
|
||
| `maturity: stable` | Confirmed by 2+ independent sources |
|
||
| `maturity: deprecated` | Superseded — add `> **DEPRECATED:** <reason>` callout at top |
|
||
| `private: true` | Required on all pages in `wiki/private/` and `raw/private/` |
|
||
|
||
Do not use semantic versioning for content. Git history tracks every change.
|
||
`maturity` captures epistemic state; `last_updated` tracks recency.
|
||
|
||
### Page types and directories
|
||
|
||
| Type | Directory | Description |
|
||
| ---------- | ---------------------------- | -------------------------------------------- |
|
||
| `source` | `wiki/sources/` | One page per processed raw source |
|
||
| `entity` | `wiki/entities/` | People, tools, organisations, projects |
|
||
| `concept` | `wiki/concepts/` | Patterns, theories, architectural decisions |
|
||
| `query` | `wiki/queries/` | Preserved answers and analyses |
|
||
| `conflict` | `wiki/queries/conflict-*.md` | Unresolved contradictions |
|
||
| `private` | `wiki/private/` | Private synthesis (PRIVATE_CONTEXT: enabled) |
|
||
| `index` | `wiki/index.md` | Primary navigation catalog (singleton) |
|
||
| `log` | `wiki/log.md` | Operations ledger (singleton) |
|
||
|
||
### Page size limits
|
||
|
||
| Limit | Lines | Action |
|
||
| -------- | ----- | ----------------------------------- |
|
||
| Soft cap | 400 | Bash linter warns |
|
||
| Hard cap | 800 | Bash linter errors — split the page |
|
||
|
||
These limits ensure pages fit within the LLM context window without attention degradation
|
||
and keep the wiki atomically navigable.
|
||
|
||
### Linking conventions
|
||
|
||
- **Intra-genome:** `[[folder/file]]` — Obsidian wikilinks only.
|
||
- **Cross-genome:** NOT supported via wikilink — submodule pointers make relative paths brittle. When the working genome needs a concept that lives elsewhere, the navigation skill **pulls it in** as one abstract raw under _this_ genome's `raw/articles/`, which then goes through normal ingest. See [Cross-genome references](#cross-genome-references).
|
||
- **External:** `[text](https://...)` — standard Markdown.
|
||
|
||
### Log format
|
||
|
||
Every operation appends one entry to `wiki/log.md`:
|
||
|
||
```markdown
|
||
## [YYYY-MM-DD] TYPE | Subject
|
||
|
||
- run_id: `<uuid>`
|
||
- model: `<model-name>`
|
||
- context_read: `[[path/A]]`, `[[path/B]]`
|
||
- output_written: `[[path/C]]`
|
||
- reasoning: One sentence — what changed and why.
|
||
```
|
||
|
||
Valid TYPEs: `INGEST` `LINT` `QUERY` `CONFLICT` `CONFIG` `SECURITY`
|
||
|
||
Parse examples:
|
||
|
||
```bash
|
||
grep "^## \[" wiki/log.md | tail -5 # Last 5 entries
|
||
grep "^## \[" wiki/log.md | grep "CONFLICT" # All conflicts
|
||
grep "^## \[2026-05" wiki/log.md # Entries from a specific month
|
||
```
|
||
|
||
The orchestrator always injects only `tail -n 20 wiki/log.md` into agent context.
|
||
The LLM never loads the full log.
|
||
|
||
---
|
||
|
||
## Collaboration Model
|
||
|
||
| Role | Key access | Permitted operations |
|
||
| -------------- | ----------------- | ----------------------------------------------------------------------------- |
|
||
| Owner | Full — key holder | Read/write everywhere |
|
||
| Collaborator | None | Push to `raw/articles/`, `raw/transcripts/`, `raw/code-packs/`, `raw/assets/` |
|
||
| Local AI agent | Conditional | `private/` only when `PRIVATE_CONTEXT: enabled` |
|
||
| Cloud AI model | Never | `PRIVATE_CONTEXT` must be `disabled`; private data stays on local network |
|
||
|
||
Grant collaborator access: add as Forgejo contributor with Write role.
|
||
Never share the git-crypt key — collaborators operate exclusively in public directories.
|
||
|
||
---
|
||
|
||
## Optional Extensions
|
||
|
||
### qmd — local Markdown search
|
||
|
||
[qmd](https://github.com/tobi/qmd) is a local, on-device BM25 + vector search
|
||
engine for Markdown files. It has both a CLI (for shell scripts and agent tool calls)
|
||
and an MCP server (for native LLM tool use).
|
||
|
||
Recommended at scale: once a genome exceeds ~150 pages, `qmd search` is significantly
|
||
faster and more accurate than navigating `wiki/index.md` manually.
|
||
|
||
```bash
|
||
# Index a genome
|
||
qmd index genome-dev/wiki/
|
||
|
||
# Search
|
||
qmd search "graph-based state management"
|
||
|
||
# Start MCP server (for Claude Code / Codex integration)
|
||
qmd serve --port 3333
|
||
```
|
||
|
||
### Obsidian integration
|
||
|
||
Obsidian is the recommended wiki browser. Open any genome directory as an Obsidian vault.
|
||
|
||
Recommended setup:
|
||
|
||
- **Graph view** — visualise page connections; spot orphans and hubs instantly
|
||
- **Obsidian Web Clipper** — browser extension to clip articles directly to `raw/articles/`
|
||
as Markdown
|
||
- **Download attachments** — Settings → Hotkeys → "Download attachments for current file".
|
||
Binds to a hotkey (e.g. Ctrl+Shift+D). After clipping, downloads all images to `raw/assets/`
|
||
- **Dataview plugin** — query YAML frontmatter across the wiki;
|
||
`TABLE maturity, last_updated WHERE domain = "genome-dev"` generates dynamic tables
|
||
- **Marp plugin** — render Markdown as slide decks directly from wiki content
|
||
|
||
Note: `.obsidian/` is in `.gitignore`. Workspace and plugin settings are local — not synced.
|
||
|
||
### n8n automation
|
||
|
||
n8n (running on the storage node) can automate the ingest pipeline:
|
||
|
||
1. Forgejo webhook fires on push to a genome's `raw/` directory
|
||
2. n8n flow identifies new files
|
||
3. For each new file: starts one agent session (sequential — never parallel)
|
||
4. Each session receives: `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
|
||
5. Phase 1 — agent runs `/skill:ingest` (semantic → writes manifest); Phase 2 —
|
||
`run-ingest.sh` does index/log/lint and opens the PR, returning one JSON line to n8n
|
||
6. Human reviews the PR
|
||
|
||
Key constraint: one source per session, sessions sequential.
|
||
Never batch multiple sources into one agent session.
|
||
|
||
### Intel NPU offloading
|
||
|
||
If the AI compute node has an Intel NPU (e.g. Core Ultra series):
|
||
|
||
- Background/auxiliary tasks (OCR of `raw/assets/`, async summarisation, or qmd
|
||
re-indexing **if** the optional qmd extension is in use) → Intel NPU via OpenVINO
|
||
- Active reasoning sessions (ingest, query, synthesis) → GPU
|
||
|
||
Note: the core system has no embedding pipeline (see [Core Philosophy](#core-philosophy)),
|
||
so there is nothing to embed here — the NPU is only for auxiliary work. This keeps the
|
||
GPU's KV cache free for interactive sessions and lowers power draw for background jobs.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### `git-crypt: command not found`
|
||
|
||
```bash
|
||
# Ubuntu/Debian
|
||
sudo apt install git-crypt
|
||
|
||
# macOS
|
||
brew install git-crypt
|
||
```
|
||
|
||
### `make setup` fails with "MISSING: jq"
|
||
|
||
```bash
|
||
make doctor # identifies all missing tools
|
||
sudo apt install git git-crypt curl jq
|
||
```
|
||
|
||
### Pre-commit hook blocks a commit with "PLAINTEXT LEAK DETECTED"
|
||
|
||
The staged file is in a path matching `**/private/**` but is not encrypted.
|
||
|
||
Fix options:
|
||
|
||
1. Verify `.gitattributes` contains `**/private/** filter=git-crypt diff=git-crypt -text`
|
||
2. Run `git-crypt init` if git-crypt is not initialised in this repo
|
||
3. Run `git-crypt status` to check the encryption state of all files
|
||
|
||
Never use `git commit --no-verify` to bypass this check.
|
||
|
||
### `git-crypt status` shows files as "not encrypted" after init
|
||
|
||
The `.gitattributes` rule must be committed before files in `private/` are staged.
|
||
If files were staged before `.gitattributes` was committed:
|
||
|
||
```bash
|
||
git rm -r --cached raw/private/ wiki/private/
|
||
git add raw/private/ wiki/private/
|
||
git commit -m "fix: re-stage private files for encryption"
|
||
```
|
||
|
||
### Agent returns stale or missing cross-references
|
||
|
||
Likely causes:
|
||
|
||
1. Session was too long — KV cache degraded. Use one source per session.
|
||
2. `wiki/index.md` was not read at session start — agent lacked the page catalog.
|
||
3. qmd index is stale — re-index: `qmd index <genome>/wiki/`
|
||
|
||
### Submodules show as "modified" after `make sync`
|
||
|
||
This is normal if genome repos have new commits. Update master's pointers:
|
||
|
||
```bash
|
||
cd master-knowledge-genome
|
||
git add .
|
||
git commit -m "chore: update submodule pointers"
|
||
git push
|
||
```
|
||
|
||
### bw unlock fails
|
||
|
||
Verify you are using `bw` (standard Bitwarden CLI), not `bws` (Secrets Manager CLI).
|
||
`bws` does not work with self-hosted Vaultwarden.
|
||
|
||
```bash
|
||
bw --version # should print e.g. "2024.x.x"
|
||
bw config server https://vault.yourserver.com
|
||
bw login
|
||
```
|