feat: Document ingest security model and human-gated workflow

This commit is contained in:
Matteo Cherubini 2026-06-05 12:02:18 +02:00
parent 2426b09b50
commit ab1141e132

View file

@ -580,6 +580,17 @@ This means: any file matching `**/private/**` in `.gitattributes` is protected,
including future `private/` directories created anywhere in the repo. including future `private/` directories created anywhere in the repo.
The hook never needs updating when the encryption rules change. The hook never needs updating when the encryption rules change.
### Untrusted agent output — manifest validation
The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
**validates the manifest before trusting any field** — it must be well-formed JSON with a
string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
knowledge tree. This is the trust boundary between the (stochastic) model and the
(deterministic, tested) post-processor.
### PRIVATE_CONTEXT toggle ### PRIVATE_CONTEXT toggle
The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
@ -753,9 +764,9 @@ For Forgejo webhook → automated ingest:
1. Forgejo sends webhook on push to `raw/` 1. Forgejo sends webhook on push to `raw/`
2. n8n receives webhook, identifies new files 2. n8n receives webhook, identifies new files
3. n8n starts one agent session per new file (sequential, not parallel) 3. n8n starts one agent session per new file (sequential, not parallel)
4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path 4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR 5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
6. Human reviews and merges PR 6. Human reviews **merge to accept**, or close the PR + delete the `feat` branch to reject
--- ---
@ -778,15 +789,21 @@ only (no shell). It:
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name, 6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
a one-line reasoning, the PR summary, and any contradictions) — then **stops** a one-line reasoning, the PR summary, and any contradictions) — then **stops**
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor **Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
consumes the manifest and does the mechanical work the model must not waste context on: **validates the manifest** — well-formed JSON, expected shape, and every page path confined to
`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
model must not waste context on:
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order** 7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
(`index-append.py`) and bumps the index `last_updated` deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` index `last_updated` (`index-append.py`)
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing 9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
`lib/lint.sh`) `lib/lint.sh`)
10. Commits on `feat/ai-ingest-<slug>` and opens the PR using `templates/pr-description.md` 10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
structure (Summary / Pages / Contradictions / Scoped Lint)
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n 11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
The agent never runs git, never edits the index/log mechanically, and never lints — those The agent never runs git, never edits the index/log mechanically, and never lints — those
@ -802,6 +819,25 @@ For private sources (`PRIVATE_CONTEXT: enabled` required):
- All output goes to `wiki/private/<slug>.md` only - All output goes to `wiki/private/<slug>.md` only
- PR title: `[PRIVATE] ingest: <slug>` - PR title: `[PRIVATE] ingest: <slug>`
**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
the orchestrator, around the human gate:
- **Before each session** the orchestrator realigns the checkout to the base
(`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
checkout to match the remote, never a force-push to the shared branch.
- **After the PR opens, everything stops** until a human approves: one source per session,
sequential, no new ingest until the pending PR is closed.
- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
shared branch.
The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
buffer ingests on `develop` and cut manual `develop → main` releases — no code change.
### Query ### Query
Triggered by an operator question. Triggered by an operator question.