4.7 KiB
| name | description | license | compatibility | metadata | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ingest | Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR. | see repository | Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled. |
|
Ingest — semantic pass (structured-JSON)
This is the light semantic pass. The model's only job is to read one source
and return a single JSON object describing what the source contains. It does
not write files, choose paths, produce frontmatter, pick slugs, or touch
git / index / log / PRs. All structure is owned by scripts/ingest-semantic.py,
which conforms the model's JSON into wiki pages with enforced kebab-case paths
and frontmatter, and writes .ingest-manifest.json in the exact schema
run-ingest.sh consumes. This keeps the agent minimal and makes the output
impossible to mis-shape, regardless of how small or quirky the local model is.
Pipeline:
cd <genome checkout>
scripts/ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
scripts/run-ingest.sh <genome> # phase 2 (deterministic)
Pre-flight (enforced by ingest-semantic.py, not by the model)
- Refuse if the source path is under any
private/directory. - Refuse if
PRIVATE_CONTEXTis notdisabled. - Confirm the file exists under
raw/and is non-empty.
What the model returns (the only contract)
A single JSON object, decoding-constrained to this shape via Ollama's format:
{
"source_title": "Human title of the source",
"source_summary": "Faithful, self-contained prose summary of the source.",
"key_points": ["Concrete fact or claim worth indexing", "..."],
"entities": [
{ "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
],
"concepts": [
{ "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
],
"contradictions": [
{ "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
],
"reasoning": "One sentence for the log: what this source adds.",
"pr_summary": "One or two sentences describing this ingest for the PR."
}
Field rules (guidance for the model; the script enforces structure):
source_summaryis faithful and in the source's own language. No markdown headings inside any description field. No padding.entities= every person, tool, org or product the source names.kind∈person|tool|org|product.description= one or two factual sentences.concepts= every pattern, theory, decision or named idea the source explains.contradictions= only a claim that directly contradicts a widely-known fact or contradicts the source itself; otherwise an empty list.- Names are the natural name of the thing. The script normalises them to kebab-case and guarantees a single stable page per entity/concept.
What the conform script guarantees (so the model cannot break it)
- Paths:
wiki/sources/<slug>.md,wiki/entities/<slug>.md,wiki/concepts/<slug>.md,wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md. - Slugs: minimal kebab-case (lowercase, digits, hyphens; no spaces / underscores / capitals).
- Frontmatter:
type,domain: <genome>,maturity: draft,last_updated: <today>,private: false,tags. - Create-vs-update: existing entity/concept pages are appended to (a section attributed to the new source), never overwritten. The source page is the canonical summary of that exact source and is (re)written.
- Manifest:
.ingest-manifest.jsonwithraw_source,reasoning,pr_summary,contradictions(string), andpages[](path,summary,status, plusmaturityon created pages) — exactly whatrun-ingest.shvalidates.
The model name is recorded by the orchestrator (INGEST_MODEL); the model does
not self-report it. No run_id, branch, commit or PR is invented here — those
belong to phase 2.
Interactive use of
pi(TUI) is unaffected and still available for manual exploration. The automated ingest path no longer relies onpior on native tool-calling: it is the single schema-constrained call above.