feat: Document ingest security model and human-gated workflow

This commit is contained in:
Matteo Cherubini 2026-06-05 12:02:18 +02:00
parent 2426b09b50
commit ab1141e132

View file

@ -580,6 +580,17 @@ This means: any file matching `**/private/**` in `.gitattributes` is protected,
including future `private/` directories created anywhere in the repo.
The hook never needs updating when the encryption rules change.
### Untrusted agent output — manifest validation
The ingest agent's output is stochastic: a hallucinated manifest could carry a missing field,
a wrong type, or a malicious path such as `wiki/../../etc/passwd`. `run-ingest.sh` therefore
**validates the manifest before trusting any field** — it must be well-formed JSON with a
string `raw_source` and an array `pages`, and **every `path` must be a string under `wiki/`
with no `..`**. Anything else fails fast with a structured `{"status":"error"}` and no
filesystem access outside the wiki, so a bad path can't drive a read or a lint outside the
knowledge tree. This is the trust boundary between the (stochastic) model and the
(deterministic, tested) post-processor.
### PRIVATE_CONTEXT toggle
The `PRIVATE_CONTEXT` toggle in `AGENTS.md` controls whether the LLM agent
@ -753,9 +764,9 @@ For Forgejo webhook → automated ingest:
1. Forgejo sends webhook on push to `raw/`
2. n8n receives webhook, identifies new files
3. n8n starts one agent session per new file (sequential, not parallel)
4. Each session: inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR
6. Human reviews and merges PR
4. Each session: realign the checkout to the base (`git switch <base> && git reset --hard origin/<base>`), then inject `tail -n 20 wiki/log.md` + `PRIVATE_CONTEXT` state + source path
5. Phase 1 agent (`/skill:ingest`) writes the manifest; Phase 2 `run-ingest.sh` opens the PR, then **stops**
6. Human reviews **merge to accept**, or close the PR + delete the `feat` branch to reject
---
@ -778,15 +789,21 @@ only (no shell). It:
6. Writes `.ingest-manifest.json` (the list of pages it created/modified, the model name,
a one-line reasoning, the PR summary, and any contradictions) — then **stops**
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor
consumes the manifest and does the mechanical work the model must not waste context on:
**Phase 2 — `run-ingest.sh` (deterministic, outside the agent).** The post-processor first
**validates the manifest** — well-formed JSON, expected shape, and every page path confined to
`wiki/` with no `..` (see [Security Model](#security-model)) — then does the mechanical work the
model must not waste context on:
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**
(`index-append.py`) and bumps the index `last_updated`
8. Appends the `INGEST | <slug>` entry to `wiki/log.md`
7. Inserts each page into the correct `wiki/index.md` section **in alphabetical order**,
deduplicated by wikilink (a re-ingest updates the entry, never duplicates it), and bumps the
index `last_updated` (`index-append.py`)
8. Appends the `INGEST | <slug>` entry to `wiki/log.md` (the model name comes from the
orchestrator via `INGEST_MODEL` — the agent cannot reliably know its own tag)
9. Runs scoped lint on exactly the pages touched this run (`scoped-lint.sh`, reusing
`lib/lint.sh`)
10. Commits on `feat/ai-ingest-<slug>` and opens the PR using `templates/pr-description.md`
10. Commits **only `wiki/`** on `feat/ai-ingest-<slug>` and opens a PR against the integration
base (`INGEST_BASE`, default `main`); the body matches the `templates/pr-description.md`
structure (Summary / Pages / Contradictions / Scoped Lint)
11. Emits a single compact JSON line (status, slug, PR url, lint_clean, conflict) for n8n
The agent never runs git, never edits the index/log mechanically, and never lints — those
@ -802,6 +819,25 @@ For private sources (`PRIVATE_CONTEXT: enabled` required):
- All output goes to `wiki/private/<slug>.md` only
- PR title: `[PRIVATE] ingest: <slug>`
**Branch lifecycle & the manual gate.** `run-ingest.sh` / `open-pr.sh` are deliberately
"dumb": they create the `feat/ai-ingest-<slug>` branch, commit only `wiki/`, open the PR, and
stop. They never reset, revert, or touch the integration branch — that lifecycle belongs to
the orchestrator, around the human gate:
- **Before each session** the orchestrator realigns the checkout to the base
(`git fetch && git switch <base> && git reset --hard origin/<base>`) — a reset of the _local_
checkout to match the remote, never a force-push to the shared branch.
- **After the PR opens, everything stops** until a human approves: one source per session,
sequential, no new ingest until the pending PR is closed.
- **Approve = merge. Reject = close the PR and delete the remote `feat` branch.** To undo an
already-merged ingest, open a _revert PR_ against the base — never rewrite history on a
shared branch.
The PR base is configurable via `INGEST_BASE` (default `main`). Per-page `maturity` already
encodes stability and tags/releases mark versioned snapshots, so `main` is the integration
branch today. If a linked project later _consumes_ a genome, set `INGEST_BASE=develop` to
buffer ingests on `develop` and cut manual `develop → main` releases — no code change.
### Query
Triggered by an operator question.