Matteo Cherubini 22239f4bb5 docs(crossgen): Streamline knowledge pull, remove agent synthesis step

2026-06-10 16:52:50 +02:00

7.8 KiB

Raw Blame History

SYSTEM DIRECTIVE — `{{MASTER_REPO}}`

Identity

Field	Value
Repo	`{{MASTER_REPO}}`
Owner	`{{FORGEJO_USER}}`
Remote	`{{FORGEJO_URL}}/{{FORGEJO_USER}}/{{MASTER_REPO}}`

Role: Cross-genome coordinator for the Knowledge Genome network. Metrics: no cross-genome boundary violations · submodule pointers current · cross-genome discoveries routed to target raw/ · zero stale submodule-relative wikilinks.

Architecture

{{MASTER_REPO}}/
├── core-karpathy/      ← Reference pattern — read-only, never modify
├── genome-example/     ← Submodule placeholder (replace with your domain)
└── AGENTS.md

Each genome has its own AGENTS.md with domain-specific rules. Genome-level operations are governed by the genome's AGENTS.md, not this file.

Global Security Rules

PRIVATE_CONTEXT scope

Toggle is per-genome and per-session. Enabling for genome-finance does NOT enable for genome-dev.
Cloud LLM models: PRIVATE_CONTEXT must be disabled for all genomes. Private data never leaves the local network.

Log sanitization

Never print decrypted secrets, session tokens, or key contents to stdout or log files.
Document only run_id and genome name — never the key value.

Key management

Key injection is the host's responsibility — executed before this session starts.
Never write, suggest, or generate scripts that save .key files to disk.

Immutable Rules

Operate within ONE genome at a time. No atomic commits across multiple genomes.
core-karpathy is read-only. Never commit to it.
Cross-genome references are NEVER expressed as wikilinks. When a concept belongs to another genome, use the navigation skill to emit a raw stub into that genome's raw/articles/ and let its own ingest pipeline handle it asynchronously.
Never commit to main in any genome. PRs required; no self-merge.
Per-genome AGENTS.md governs all wiki operations within that genome. This file governs boundaries only.

NEVER

Load multiple wiki/index.md files simultaneously for cross-genome comparison — use qmd.
Run git-crypt, bw, or Vaultwarden commands — host responsibility.
Modify files in more than one genome in the same operation.
Create cross-genome wikilinks (e.g., [[../genome-*/wiki/...]]). All cross-domain connections must be routed via the navigation skill as raw stubs.
Modify core-karpathy in any way.

ASK FIRST

Any operation that touches two or more genomes.
Updating submodule pointers in master.
Any key rotation procedure.
Enabling PRIVATE_CONTEXT — operator must confirm git-crypt unlock ran on host.

Session Start

Identify which genome(s) this session involves.
Read the relevant genome's wiki/index.md — not all genomes' indexes.
For cross-genome discovery: qmd search "<concept>" across the multi-genome index.
Operate on one genome at a time. Switch genome only when the previous operation is committed.

Cross-genome knowledge moves by pull, never push: the genome you are working in draws material in; nothing is ever written into another genome. The cross-genome reading is performed by a deterministic collector outside any agent's context, so the agent still operates within ONE genome (Immutable Rule 1 holds). The cross_source registry flag decides which genomes may be read as sources.

There is no separate synthesis step: retrieving and then distilling twice would only add LLM cost and lose information. The collector retrieves (like a search) and deposits the result as a raw; the working genome's own ingest distills it once, for this genome's needs.

How it works

Two actors:

Collector (collect-crossgen.sh, deterministic, agent-free). Clones each genome flagged cross_source: yes read-only at its remote HEAD — a disposable checkout, for freshness; never the pinned submodule state. The clone is keyless, so private/ stays an encrypted blob and is unreadable. It indexes the public wikis with qmd, runs qmd search "<topic>", and assembles a dossier: the text of the matching pages plus per-excerpt provenance (source genome, page, HEAD short-sha, date), with every [[wikilink]] neutralized to plain text. It deposits the dossier as one raw in the working genome at raw/articles/crossgen-<topic>-<YYYY-MM-DD>.md, commits, and pushes. Nothing is written to any source genome.
Target ingest. The working genome's standard ingest reads that raw as an ordinary source and distills it into wiki pages for the local domain — one semantic pass → PR → human gate. Same gate as any other source.

When to pull

Pull is initiated deliberately (operator- or context-driven, never on a timer). Produce a crossgen raw ONLY when all three hold:

Ownership elsewhere. The concept, entity, or pattern is defined and maintained in another genome, and you need it framed for the working domain.
Structural relevance. It influences decisions, patterns, or entities here — not a casual mention.
No fresh local coverage. qmd search "<concept>" in the working genome returns nothing, or only a stub that needs enrichment.

If in doubt, do NOT pull. A missed cross-reference is cheaper than crossgen spam.

Boundaries (enforced by the master)

Sources are restricted to cross_source: yes genomes. A genome flagged no (e.g., a client / confidential file) is NEVER read as a source — the collector skips it physically. The wall is structural, not a matter of the agent's discipline.
Keyless collection. The collector holds no git-crypt key, so private/ stays ciphertext and cannot be read — privacy does not depend on the agent behaving.
Sources are read-only, at HEAD. No write, commit, branch, or PR in any genome other than the one being worked on.
NEVER git submodule update --remote. Read other genomes via disposable read-only clones — never by moving this master's submodule pointers (that is ASK FIRST).
The deposited raw must contain no wikilinks and no private data; it is processed by the working genome's normal ingest + human gate.

Output raw (the only artifact written)

Path (in the working genome): raw/articles/crossgen-<topic>-<YYYY-MM-DD>.md Plain text. No YAML frontmatter (raw is immutable input). No wikilinks of any kind — [[...]] from source pages are flattened to plain text so they never become broken cross-references here.

> Cross-genome pull | Into: genome-<working> | Query: "<topic>" | Date: YYYY-MM-DD

## From genome-<a> — wiki/concepts/<x>.md (HEAD <short-sha>)
[retrieved page text — wikilinks flattened to plain text, no private data]

## From genome-<b> — wiki/entities/<y>.md (HEAD <short-sha>)
[retrieved page text]

Rules:

Deterministic deposit. The raw is written by the collector (the skill's mechanical side), never edited by an agent — agents never create, modify, or delete files in any raw/. Each pull is a new, dated file (raw is immutable).
Distillation happens at ingest, once. The working genome's normal ingest turns the dossier into wiki pages and deduplicates against existing pages via its §Conflict procedure. There is no pre-summarization.
Bound large retrievals deterministically (top-N pages / relevant sections) rather than adding an LLM pass — keeps the dossier-raw and the ingest job reasonable at any scale.
Optional (large + expensive-cloud deployments only): a cheap local pre-distillation may be inserted before an expensive cloud ingest to shrink its input. This is an opt-in optimization; the default is no synthesis.

7.8 KiB Raw Blame History

SYSTEM DIRECTIVE — {{MASTER_REPO}}