knowledge-genome-orchestrator/skills/ingest/SKILL.md

92 lines
4.7 KiB
Markdown

---
name: ingest
description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR.
license: see repository
compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
metadata:
framework: knowledge-genome
phase: "1-ingest-semantic"
mode: structured-json # lightweight agent + deterministic conform
---
# Ingest — semantic pass (structured-JSON)
This is the **light** semantic pass. The model's only job is to read one source
and return a single JSON object describing what the source contains. It does
**not** write files, choose paths, produce frontmatter, pick slugs, or touch
git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`,
which conforms the model's JSON into wiki pages with enforced kebab-case paths
and frontmatter, and writes `.ingest-manifest.json` in the exact schema
`run-ingest.sh` consumes. This keeps the agent minimal and makes the output
impossible to mis-shape, regardless of how small or quirky the local model is.
Pipeline:
cd <genome checkout>
scripts/ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
scripts/run-ingest.sh <genome> # phase 2 (deterministic)
## Pre-flight (enforced by ingest-semantic.py, not by the model)
1. Refuse if the source path is under any `private/` directory.
2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
3. Confirm the file exists under `raw/` and is non-empty.
## What the model returns (the only contract)
A single JSON object, decoding-constrained to this shape via Ollama's `format`:
```json
{
"source_title": "Human title of the source",
"source_summary": "Faithful, self-contained prose summary of the source.",
"key_points": ["Concrete fact or claim worth indexing", "..."],
"entities": [
{ "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
],
"concepts": [
{ "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
],
"contradictions": [
{ "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
],
"reasoning": "One sentence for the log: what this source adds.",
"pr_summary": "One or two sentences describing this ingest for the PR."
}
```
Field rules (guidance for the model; the script enforces _structure_):
- `source_summary` is faithful and in the source's own language. No markdown
headings inside any description field. No padding.
- `entities` = every person, tool, org or product the source names. `kind`
`person|tool|org|product`. `description` = one or two factual sentences.
- `concepts` = every pattern, theory, decision or named idea the source explains.
- `contradictions` = only a claim that directly contradicts a widely-known fact
or contradicts the source itself; otherwise an empty list.
- Names are the natural name of the thing. The script normalises them to
kebab-case and guarantees a single stable page per entity/concept.
## What the conform script guarantees (so the model cannot break it)
- **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
`wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
underscores / capitals).
- **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
`last_updated: <today>`, `private: false`, `tags`.
- **Create-vs-update:** existing entity/concept pages are **appended to** (a
section attributed to the new source), never overwritten. The source page is
the canonical summary of that exact source and is (re)written.
- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
`pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
`status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
validates.
The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
not self-report it. No `run_id`, branch, commit or PR is invented here — those
belong to phase 2.
> Interactive use of `pi` (TUI) is unaffected and still available for manual
> exploration. The **automated** ingest path no longer relies on `pi` or on
> native tool-calling: it is the single schema-constrained call above.