From 22239f4bb557a5c80df5aeee38b04385d067d119 Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Wed, 10 Jun 2026 16:52:45 +0200 Subject: [PATCH 1/3] docs(crossgen): Streamline knowledge pull, remove agent synthesis step --- templates/agents-master.md | 39 ++++++++++++++++++-------------------- 1 file changed, 18 insertions(+), 21 deletions(-) diff --git a/templates/agents-master.md b/templates/agents-master.md index 3cf2c52..cd08be1 100644 --- a/templates/agents-master.md +++ b/templates/agents-master.md @@ -84,13 +84,14 @@ Genome-level operations are governed by the genome's `AGENTS.md`, not this file. Cross-genome knowledge moves by **pull, never push**: the genome you are working in draws material *in*; nothing is ever written into another genome. The cross-genome reading is performed by a deterministic collector **outside any agent's context**, so the agent still operates within ONE genome (Immutable Rule 1 holds). The `cross_source` registry flag decides which genomes may be read as sources. +There is **no separate synthesis step**: retrieving and then distilling twice would only add LLM cost and lose information. The collector *retrieves* (like a search) and deposits the result as a raw; the working genome's own ingest *distills* it once, for this genome's needs. + ### How it works -Three actors, mirroring the ingest two-phase split: +Two actors: -1. **Collector** (`collect-crossgen.sh`, deterministic, agent-free). Clones each genome flagged `cross_source: yes` **read-only at its remote HEAD** — a disposable checkout, for freshness; never the pinned submodule state. Reads each `wiki/index.md` plus the relevant pages and assembles a **dossier of excerpts with provenance** (source genome, page, date/commit). Writes nothing to any source genome. -2. **Synthesis** (agent, navigation skill, `read`/`edit` only). Reads **only the dossier** — a single artifact inside the working genome's context — then the skill deposits **one** abstract, non-private raw into the working genome at `raw/articles/crossgen--.md`, and STOPS. -3. **Target ingest.** The working genome's own standard pipeline processes that raw → PR → human gate. Same gate as any other source. +1. **Collector** (`collect-crossgen.sh`, deterministic, agent-free). Clones each genome flagged `cross_source: yes` **read-only at its remote HEAD** — a disposable checkout, for freshness; never the pinned submodule state. The clone is **keyless**, so `private/` stays an encrypted blob and is unreadable. It indexes the public wikis with `qmd`, runs `qmd search ""`, and assembles a **dossier**: the text of the matching pages plus per-excerpt provenance (source genome, page, HEAD short-sha, date), with every `[[wikilink]]` neutralized to plain text. It deposits the dossier as **one** raw in the working genome at `raw/articles/crossgen--.md`, commits, and pushes. Nothing is written to any source genome. +2. **Target ingest.** The working genome's standard ingest reads that raw as an ordinary source and distills it into wiki pages for the local domain — one semantic pass → PR → human gate. Same gate as any other source. ### When to pull @@ -104,34 +105,30 @@ If in doubt, do NOT pull. A missed cross-reference is cheaper than crossgen spam ### Boundaries (enforced by the master) -- **Sources are restricted to `cross_source: yes` genomes.** A genome flagged `no` (e.g., a client / confidential file) is NEVER read as a source — the collector skips it physically. The wall decides what may flow; it does not rely on the agent's discipline. +- **Sources are restricted to `cross_source: yes` genomes.** A genome flagged `no` (e.g., a client / confidential file) is NEVER read as a source — the collector skips it physically. The wall is structural, not a matter of the agent's discipline. +- **Keyless collection.** The collector holds no git-crypt key, so `private/` stays ciphertext and cannot be read — privacy does not depend on the agent behaving. - **Sources are read-only, at HEAD.** No write, commit, branch, or PR in any genome other than the one being worked on. - **NEVER `git submodule update --remote`.** Read other genomes via disposable read-only clones — never by moving this master's submodule pointers (that is ASK FIRST). -- **NEVER read `*/private/*`.** The skill runs `PRIVATE_CONTEXT: disabled` and `private/` is an encrypted blob; even on an unlocked host, private paths are off-limits. -- Confidential / client genomes are normally isolated from cross-genome pulls entirely (operator policy). Whatever genome a pull runs into, the output raw must be abstract and non-private. +- The deposited raw must contain **no wikilinks and no private data**; it is processed by the working genome's normal ingest + human gate. ### Output raw (the only artifact written) **Path (in the working genome):** `raw/articles/crossgen--.md` -Plain text. No YAML frontmatter (raw is immutable input). **No wikilinks of any kind** — never a `[[../genome-*/...]]` path. +Plain text. No YAML frontmatter (raw is immutable input). **No wikilinks of any kind** — `[[...]]` from source pages are flattened to plain text so they never become broken cross-references here. ```markdown -> Cross-genome pull | Into: genome- | Sources: genome- (wiki/concepts/x.md), genome- (wiki/entities/y.md) | HEAD: | Date: YYYY-MM-DD +> Cross-genome pull | Into: genome- | Query: "" | Date: YYYY-MM-DD -# (synthesized from other genomes) +## From genome- — wiki/concepts/.md (HEAD ) +[retrieved page text — wikilinks flattened to plain text, no private data] -## What the source genomes say -[Abstract, faithful synthesis of the relevant material. Plain text, no private data, no wikilinks.] - -## Relevance to this genome -[Why it matters in the working domain; textual references to existing local entities, if any.] - -## Suggested local action -[Semantic hint for this genome's ingest: e.g., create/update wiki/concepts/.md, map local relationships.] +## From genome- — wiki/entities/.md (HEAD ) +[retrieved page text] ``` **Rules:** -- Each pull writes a **new, dated** crossgen file — never overwrite or edit an existing raw (raw is immutable). Deduplication happens later, at the **wiki** level: the working genome's normal ingest reconciles against existing pages via its §Conflict procedure. -- The raw is processed by the working genome's standard ingest as an ordinary `raw/articles/` source — no special path. -- The collector and the raw deposit are the **deterministic** side of the skill; the agent only synthesizes content. Agents never create, modify, or delete files in any `raw/` directly. +- **Deterministic deposit.** The raw is written by the collector (the skill's mechanical side), never edited by an agent — agents never create, modify, or delete files in any `raw/`. Each pull is a **new, dated** file (raw is immutable). +- **Distillation happens at ingest, once.** The working genome's normal ingest turns the dossier into wiki pages and **deduplicates against existing pages** via its §Conflict procedure. There is no pre-summarization. +- **Bound large retrievals deterministically** (top-N pages / relevant sections) rather than adding an LLM pass — keeps the dossier-raw and the ingest job reasonable at any scale. +- *Optional (large + expensive-cloud deployments only):* a cheap **local** pre-distillation may be inserted before an expensive cloud ingest to shrink its input. This is an opt-in optimization; the default is no synthesis. From bad41d63132a5bd00a50056d44ad974736ce6cc1 Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Wed, 10 Jun 2026 17:20:02 +0200 Subject: [PATCH 2/3] feat: Introduce `cross_source` flag for genome registry entries --- registry.sh | 12 +++++++----- scripts/add-genome.sh | 18 ++++++++++++++---- scripts/setup-genomes.sh | 7 ++++--- 3 files changed, 25 insertions(+), 12 deletions(-) diff --git a/registry.sh b/registry.sh index 596c462..2c20200 100644 --- a/registry.sh +++ b/registry.sh @@ -19,13 +19,15 @@ LIB_DIR="${PROJECT_ROOT}/lib" PROVIDERS_DIR="${PROJECT_ROOT}/providers" # --- GENOME REGISTRY --- -# Format: "name|description|linked_repo" -# - linked_repo is OPTIONAL. Leave empty (trailing pipe) for knowledge-only genomes. +# Format: "name|description|linked_repo|cross_source" +# - linked_repo: OPTIONAL. Leave empty for knowledge-only genomes. +# - cross_source: "yes" or "no" (default: no). Controls whether the collector +# may read this genome as a source during cross-genome pulls. # # HOW TO CUSTOMIZE: # Replace the placeholder below with your actual genome domains. -# Example: "genome-work|Work notes and architecture logs|" -# "genome-finance|Personal finance|user/repo-finance" +# Example: "genome-work|Work notes and architecture logs||no" +# "genome-finance|Personal finance|user/repo-finance|no" GENOMES=( - "genome-example|Template genome description for knowledge management|" + "genome-example|Template genome description for knowledge management||no" ) diff --git a/scripts/add-genome.sh b/scripts/add-genome.sh index dc4dd6c..0bf6b54 100644 --- a/scripts/add-genome.sh +++ b/scripts/add-genome.sh @@ -11,18 +11,28 @@ source "registry.sh" GENOME_NAME="${1:-}" GENOME_DESC="${2:-}" -GENOME_LINKED="${3:-}" # optional: linked project repo reference +GENOME_LINKED="${3:-}" # optional: linked project repo reference +GENOME_CROSS_SOURCE="${4:-no}" # optional: cross_source flag (default: no) +# 1. Check mandatory arguments first if [[ -z "$GENOME_NAME" || -z "$GENOME_DESC" ]]; then error "Missing arguments." - echo "Usage: $0 [linked-repo]" + echo "Usage: $0 [linked-repo] [cross_source]" + echo " cross_source: yes|no (default: no)" + exit 1 +fi + +# 2. Then validate the flag if a non-default value was passed +if [[ "$GENOME_CROSS_SOURCE" != "yes" && "$GENOME_CROSS_SOURCE" != "no" ]]; then + error "Invalid cross_source value: $GENOME_CROSS_SOURCE" + echo "cross_source must be 'yes' or 'no'" exit 1 fi step "Adding New Genome: ${GENOME_NAME}" -# Build a 3-field registry entry (linked_repo may be empty) -GENOMES=("${GENOME_NAME}|${GENOME_DESC}|${GENOME_LINKED}") +# Build a 4-field registry entry (linked_repo may be empty, cross_source defaults to no) +GENOMES=("${GENOME_NAME}|${GENOME_DESC}|${GENOME_LINKED}|${GENOME_CROSS_SOURCE}") # NOTE — Maintenance smell # We source setup-genomes.sh as a library/orchestrator hybrid. This works because: diff --git a/scripts/setup-genomes.sh b/scripts/setup-genomes.sh index b30af18..f3aa6cd 100644 --- a/scripts/setup-genomes.sh +++ b/scripts/setup-genomes.sh @@ -20,10 +20,11 @@ step "Processing Genome Registry" for entry in "${GENOMES[@]}"; do # 3-field format: name|description|linked_repo (linked_repo optional → may be empty) - IFS='|' read -r GENOME_NAME GENOME_DESC GENOME_LINKED <<< "$entry" - export GENOME_NAME GENOME_DESC GENOME_LINKED + IFS='|' read -r GENOME_NAME GENOME_DESC GENOME_LINKED GENOME_CROSS_SOURCE <<< "$entry" + GENOME_CROSS_SOURCE="${GENOME_CROSS_SOURCE:-no}" + export GENOME_NAME GENOME_DESC GENOME_LINKED GENOME_CROSS_SOURCE - info "Processing: ${GENOME_NAME}..." + info "Processing: ${GENOME_NAME} (cross_source: ${GENOME_CROSS_SOURCE})..." # 1. Remote Creation (Idempotent) provider_create_repo "${GENOME_NAME}" "${GENOME_DESC}" "true" From 678c203fde466814b29bbb03dfcc9fe69c0cf0bf Mon Sep 17 00:00:00 2001 From: Matteo Cherubini Date: Thu, 11 Jun 2026 10:25:11 +0200 Subject: [PATCH 3/3] Update version --- Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index bcde255..26c149c 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,5 @@ # ============================================================================= -# Knowledge Genome - Makefile v. 1.1.5 +# Knowledge Genome - Makefile v. 1.1.6 # Orchestrates the setup and management of the knowledge base. # =============================================================================