feat: Document ingest security model and human-gated workflow

2026-06-05 12:02:18 +02:00 · 2026-06-05 12:02:18 +02:00 · ab1141e132
commit ab1141e132
parent 2426b09b50
1 changed files with 45 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -580,6 +580,17 @@ This means: any file matching `**/private/**` in `.gitattributes` is protected,
 including future `private/` directories created anywhere in the repo.
 The hook never needs updating when the encryption rules change.

+### Untrusted agent output — manifest validation
+
+The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
+a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
+**validates the manifest before trusting any field** — it must be well-formed JSON with a
+string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
+with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
+filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
+knowledge tree. This is the trust boundary between the (stochastic) model and the
+(deterministic, tested) post-processor.
+
 ### PRIVATE_CONTEXT toggle

 The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
@ -753,9 +764,9 @@ For Forgejo webhook → automated ingest:
 1. Forgejo sends webhook on push to `raw/`
 2. n8n receives webhook, identifies new files
 3. n8n starts one agent session per new file (sequential, not parallel)
-4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
-5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR
-6. Human reviews and merges PR
+4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
+5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
+6. Human reviews — **merge to accept**, or close the PR + delete the `feat` branch to reject

 ---

@ -778,15 +789,21 @@ only (no shell). It:
 6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
   a one-line reasoning, the PR summary, and any contradictions) — then **stops**

-**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor
-consumes the manifest and does the mechanical work the model must not waste context on:
+**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
+**validates the manifest** — well-formed JSON, expected shape, and every page path confined to
+`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
+model must not waste context on:

-7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**
-   (`index-append.py`) and bumps the index `last_updated`
-8. Appends the `INGEST | <slug>` entry to `wiki/log.md`
+7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
+   deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
+   index `last_updated` (`index-append.py`)
+8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
+   orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
 9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
   `lib/lint.sh`)
-10. Commits on `feat/ai-ingest-<slug>` and opens the PR using `templates/pr-description.md`
+10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
+    base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
+    structure (Summary / Pages / Contradictions / Scoped Lint)
 11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n

 The agent never runs git, never edits the index/log mechanically, and never lints — those
@ -802,6 +819,25 @@ For private sources (`PRIVATE_CONTEXT: enabled` required):
 - All output goes to `wiki/private/<slug>.md` only
 - PR title: `[PRIVATE] ingest: <slug>`

+**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
+"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
+stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
+the orchestrator, around the human gate:
+
+- **Before each session** the orchestrator realigns the checkout to the base
+  (`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
+  checkout to match the remote, never a force-push to the shared branch.
+- **After the PR opens, everything stops** until a human approves: one source per session,
+  sequential, no new ingest until the pending PR is closed.
+- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
+  already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
+  shared branch.
+
+The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
+encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
+branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
+buffer ingests on `develop` and cut manual `develop → main` releases — no code change.
+
 ### Query

 Triggered by an operator question.