92 lines
4.7 KiB
Markdown
92 lines
4.7 KiB
Markdown
---
|
|
name: ingest
|
|
description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR.
|
|
license: see repository
|
|
compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
|
|
metadata:
|
|
framework: knowledge-genome
|
|
phase: "1-ingest-semantic"
|
|
mode: structured-json # lightweight agent + deterministic conform
|
|
---
|
|
|
|
# Ingest — semantic pass (structured-JSON)
|
|
|
|
This is the **light** semantic pass. The model's only job is to read one source
|
|
and return a single JSON object describing what the source contains. It does
|
|
**not** write files, choose paths, produce frontmatter, pick slugs, or touch
|
|
git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`,
|
|
which conforms the model's JSON into wiki pages with enforced kebab-case paths
|
|
and frontmatter, and writes `.ingest-manifest.json` in the exact schema
|
|
`run-ingest.sh` consumes. This keeps the agent minimal and makes the output
|
|
impossible to mis-shape, regardless of how small or quirky the local model is.
|
|
|
|
Pipeline:
|
|
|
|
cd <genome checkout>
|
|
scripts/ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
|
|
scripts/run-ingest.sh <genome> # phase 2 (deterministic)
|
|
|
|
## Pre-flight (enforced by ingest-semantic.py, not by the model)
|
|
|
|
1. Refuse if the source path is under any `private/` directory.
|
|
2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
|
|
3. Confirm the file exists under `raw/` and is non-empty.
|
|
|
|
## What the model returns (the only contract)
|
|
|
|
A single JSON object, decoding-constrained to this shape via Ollama's `format`:
|
|
|
|
```json
|
|
{
|
|
"source_title": "Human title of the source",
|
|
"source_summary": "Faithful, self-contained prose summary of the source.",
|
|
"key_points": ["Concrete fact or claim worth indexing", "..."],
|
|
"entities": [
|
|
{ "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
|
|
],
|
|
"concepts": [
|
|
{ "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
|
|
],
|
|
"contradictions": [
|
|
{ "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
|
|
],
|
|
"reasoning": "One sentence for the log: what this source adds.",
|
|
"pr_summary": "One or two sentences describing this ingest for the PR."
|
|
}
|
|
```
|
|
|
|
Field rules (guidance for the model; the script enforces _structure_):
|
|
|
|
- `source_summary` is faithful and in the source's own language. No markdown
|
|
headings inside any description field. No padding.
|
|
- `entities` = every person, tool, org or product the source names. `kind` ∈
|
|
`person|tool|org|product`. `description` = one or two factual sentences.
|
|
- `concepts` = every pattern, theory, decision or named idea the source explains.
|
|
- `contradictions` = only a claim that directly contradicts a widely-known fact
|
|
or contradicts the source itself; otherwise an empty list.
|
|
- Names are the natural name of the thing. The script normalises them to
|
|
kebab-case and guarantees a single stable page per entity/concept.
|
|
|
|
## What the conform script guarantees (so the model cannot break it)
|
|
|
|
- **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
|
|
`wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
|
|
- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
|
|
underscores / capitals).
|
|
- **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
|
|
`last_updated: <today>`, `private: false`, `tags`.
|
|
- **Create-vs-update:** existing entity/concept pages are **appended to** (a
|
|
section attributed to the new source), never overwritten. The source page is
|
|
the canonical summary of that exact source and is (re)written.
|
|
- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
|
|
`pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
|
|
`status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
|
|
validates.
|
|
|
|
The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
|
|
not self-report it. No `run_id`, branch, commit or PR is invented here — those
|
|
belong to phase 2.
|
|
|
|
> Interactive use of `pi` (TUI) is unaffected and still available for manual
|
|
> exploration. The **automated** ingest path no longer relies on `pi` or on
|
|
> native tool-calling: it is the single schema-constrained call above.
|