diff --git a/README.md b/README.md index 96fe9c2..c0521a3 100644 --- a/README.md +++ b/README.md @@ -580,6 +580,17 @@ This means: any file matching `**/private/**` in `.gitattributes` is protected, including future `private/` directories created anywhere in the repo. The hook never needs updating when the encryption rules change. +### Untrusted agent output — manifest validation + +The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field, +a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore +**validates the manifest before trusting any field** — it must be well-formed JSON with a +string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/` +with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no +filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the +knowledge tree. This is the trust boundary between the (stochastic) model and the +(deterministic, tested) post-processor. + ### PRIVATE_CONTEXT toggle The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent @@ -753,9 +764,9 @@ For Forgejo webhook → automated ingest: 1. Forgejo sends webhook on push to `raw/` 2. n8n receives webhook, identifies new files 3. n8n starts one agent session per new file (sequential, not parallel) -4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path -5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR -6. Human reviews and merges PR +4. Each session: realign the checkout to the base (`git switch && git reset --hard origin/`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path +5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops** +6. Human reviews — **merge to accept**, or close the PR + delete the `feat` branch to reject --- @@ -778,15 +789,21 @@ only (no shell). It: 6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name, a one-line reasoning, the PR summary, and any contradictions) — then **stops** -**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor -consumes the manifest and does the mechanical work the model must not waste context on: +**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first +**validates the manifest** — well-formed JSON, expected shape, and every page path confined to +`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the +model must not waste context on: -7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order** - (`index-append.py`) and bumps the index `last_updated` -8. Appends the `INGEST | ` entry to `wiki/log.md` +7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**, + deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the + index `last_updated` (`index-append.py`) +8. Appends the `INGEST | ` entry to `wiki/log.md` (the model name comes from the + orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag) 9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing `lib/lint.sh`) -10. Commits on `feat/ai-ingest-` and opens the PR using `templates/pr-description.md` +10. Commits **only `wiki/`** on `feat/ai-ingest-` and opens a PR against the integration + base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md` + structure (Summary / Pages / Contradictions / Scoped Lint) 11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n The agent never runs git, never edits the index/log mechanically, and never lints — those @@ -802,6 +819,25 @@ For private sources (`PRIVATE_CONTEXT: enabled` required): - All output goes to `wiki/private/.md` only - PR title: `[PRIVATE] ingest: ` +**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately +"dumb": they create the `feat/ai-ingest-` branch, commit only `wiki/`, open the PR, and +stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to +the orchestrator, around the human gate: + +- **Before each session** the orchestrator realigns the checkout to the base + (`git fetch && git switch && git reset --hard origin/`) — a reset of the _local_ + checkout to match the remote, never a force-push to the shared branch. +- **After the PR opens, everything stops** until a human approves: one source per session, + sequential, no new ingest until the pending PR is closed. +- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an + already-merged ingest, open a _revert PR_ against the base — never rewrite history on a + shared branch. + +The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already +encodes stability and tags/releases mark versioned snapshots, so `main` is the integration +branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to +buffer ingests on `develop` and cut manual `develop → main` releases — no code change. + ### Query Triggered by an operator question.