feat(ingest): Implement 'light' semantic ingest with ingest-semantic.py

This commit is contained in:
Matteo Cherubini 2026-06-18 15:26:53 +02:00
parent d207a0fc91
commit fdd7e1e92b
2 changed files with 346 additions and 70 deletions

View file

@ -1,93 +1,92 @@
--- ---
name: ingest name: ingest
description: Semantic pass of a single raw source into the current genome's wiki — read the source, write sources/entities/concepts, handle contradictions, then emit a manifest and STOP. Use when a new file lands in raw/. Does NOT do git, log, index, lint, or PRs (a post-processor handles those), and does NOT handle private sources or project repos. description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR.
license: see repository license: see repository
compatibility: Runs inside one genome checkout (cwd = genome root). Tools needed — read, edit only. NO bash, NO git. The deterministic steps (index, log, scoped lint, PR) run AFTER you exit, via run-ingest.sh. PRIVATE_CONTEXT must be disabled. compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
allowed-tools: read edit
metadata: metadata:
framework: knowledge-genome framework: knowledge-genome
phase: "1-ingest-semantic" phase: "1-ingest-semantic"
mode: structured-json # lightweight agent + deterministic conform
--- ---
# Ingest — semantic pass # Ingest — semantic pass (structured-JSON)
You run inside ONE genome checkout. `AGENTS.md` (already in your context) is the This is the **light** semantic pass. The model's only job is to read one source
authoritative contract. Your job is the **semantic pass only**: read the source, write and return a single JSON object describing what the source contains. It does
the wiki pages, handle contradictions. You do **not** touch git, the log, the index, the **not** write files, choose paths, produce frontmatter, pick slugs, or touch
linter, or PRs — a post-processor (`run-ingest.sh`) does all of that _after you stop_, git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`,
from the manifest you leave behind. This keeps your context clean and your turns few, which conforms the model's JSON into wiki pages with enforced kebab-case paths
which matters on a small local model. and frontmatter, and writes `.ingest-manifest.json` in the exact schema
`run-ingest.sh` consumes. This keeps the agent minimal and makes the output
impossible to mis-shape, regardless of how small or quirky the local model is.
**Argument:** the relative path of the single raw source to ingest Pipeline:
(e.g. `raw/articles/foo.md`). Process only this one.
## Pre-flight — stop the session if any check fails cd <genome checkout>
scripts/ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
scripts/run-ingest.sh <genome> # phase 2 (deterministic)
1. Refuse if the argument path is under any `private/` directory. ## Pre-flight (enforced by ingest-semantic.py, not by the model)
1. Refuse if the source path is under any `private/` directory.
2. Refuse if `PRIVATE_CONTEXT` is not `disabled`. 2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
3. Confirm the file exists under `raw/`. 3. Confirm the file exists under `raw/` and is non-empty.
## Semantic work (your only job) ## What the model returns (the only contract)
1. Read the source once. A single JSON object, decoding-constrained to this shape via Ollama's `format`:
2. Write `wiki/sources/<kebab-slug>.md` — faithful summary + key points, with the required
frontmatter (`type: source`, `domain: <genome>`, `maturity: draft`,
`last_updated: <today>`, `private: false`, sensible `tags`).
3. For each entity (person, tool, org) → create or update `wiki/entities/<kebab-name>.md`.
4. For each concept (pattern, theory, decision) → create or update
`wiki/concepts/<kebab-name>.md`.
5. On a real contradiction with an existing claim, follow `AGENTS.md` §Conflict: create
`wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`. Never overwrite the existing page.
**Naming — you are the sole author of these names; nothing renames your files.** Use
minimal kebab-case: lowercase letters, digits and hyphens only — no spaces, no underscores,
no capitals. Pick stable names so the same entity is never created twice (always `acme`,
never also `acme-corp`). The path you write a file to MUST be byte-for-byte the path you
list in the manifest.
**Deciding create-vs-update and spotting contradictions — mind the context budget.** Use
`wiki/index.md` to locate existing pages, then read **only** the handful that _this source
actually names_ — the entities and concepts in the source's title and opening paragraphs —
not everything the index lists. When in doubt, read fewer: a missed cross-link is far
cheaper than a saturated context. Never scan whole directories.
## Finish: write the manifest, then STOP
As your **final action**, write `.ingest-manifest.json` at the genome root
(NOT under `wiki/`) describing exactly what you did. Then stop — do not commit, lint,
append to the log/index, or open anything.
```json ```json
{ {
"raw_source": "raw/articles/foo.md", "source_title": "Human title of the source",
"reasoning": "One sentence for the log: what changed and why.", "source_summary": "Faithful, self-contained prose summary of the source.",
"pr_summary": "One or two sentences describing this ingest for the PR.", "key_points": ["Concrete fact or claim worth indexing", "..."],
"contradictions": "None (or: 1 conflict file created — <concept>)", "entities": [
"pages": [ { "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
{ ],
"path": "wiki/sources/foo.md", "concepts": [
"summary": "One-line index summary.", { "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
"maturity": "draft", ],
"status": "created" "contradictions": [
}, { "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
{ ],
"path": "wiki/entities/acme.md", "reasoning": "One sentence for the log: what this source adds.",
"summary": "Acme — vendor.", "pr_summary": "One or two sentences describing this ingest for the PR."
"status": "modified"
}
]
} }
``` ```
Manifest rules: Field rules (guidance for the model; the script enforces _structure_):
- List every page you created or modified, with `status` `created` or `modified`. - `source_summary` is faithful and in the source's own language. No markdown
- `summary` is the one-line index description (≈12 words max). For conflict pages the headings inside any description field. No padding.
summary is ignored — the index lists conflicts by slug only. - `entities` = every person, tool, org or product the source names. `kind`
- `maturity` is required only on `created` pages (it seeds the new index entry). It is `person|tool|org|product`. `description` = one or two factual sentences.
ignored for `modified` pages, so omit it there. - `concepts` = every pattern, theory, decision or named idea the source explains.
- Do NOT add a `model` field — the orchestrator records which model produced this run; you - `contradictions` = only a claim that directly contradicts a widely-known fact
cannot know your own model name reliably, so do not guess one. or contradicts the source itself; otherwise an empty list.
- Do not invent a `run_id`, branch, commit, or PR — those belong to the post-processor. - Names are the natural name of the thing. The script normalises them to
kebab-case and guarantees a single stable page per entity/concept.
One source per session. After writing the manifest, stop. ## What the conform script guarantees (so the model cannot break it)
- **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
`wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
underscores / capitals).
- **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
`last_updated: <today>`, `private: false`, `tags`.
- **Create-vs-update:** existing entity/concept pages are **appended to** (a
section attributed to the new source), never overwritten. The source page is
the canonical summary of that exact source and is (re)written.
- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
`pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
`status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
validates.
The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
not self-report it. No `run_id`, branch, commit or PR is invented here — those
belong to phase 2.
> Interactive use of `pi` (TUI) is unaffected and still available for manual
> exploration. The **automated** ingest path no longer relies on `pi` or on
> native tool-calling: it is the single schema-constrained call above.

View file

@ -0,0 +1,277 @@
#!/usr/bin/env python3
# =============================================================================
# skills/ingest/scripts/ingest-semantic.py
# Phase 1 (semantic) of the Knowledge Genome ingest — the LIGHT version.
#
# The model does ONLY semantic extraction and returns ONE schema-constrained JSON
# object (no tools, no file writing, no git, no frontmatter, no slugs). This script
# then CONFORMS that output deterministically into wiki pages with enforced
# frontmatter + kebab-case paths, and writes a .ingest-manifest.json in EXACTLY the
# schema run-ingest.sh expects. run-ingest.sh (phase 2) then does index / log /
# scoped-lint / PR, unchanged.
#
# cd <genome checkout>
# ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
# run-ingest.sh <genome> # phase 2 (deterministic)
#
# Why this shape: local tool-calling via pi/ollama proved fragile, and a small
# model does not reliably honour folders / naming / frontmatter / manifest schema
# when it writes files itself. Here the model cannot break the contract because it
# never touches the filesystem — the script owns all structure. Stdlib only.
#
# Emits a single JSON status line on stdout (for n8n / logs).
# =============================================================================
import json, os, re, sys, datetime, urllib.request, urllib.error
# --- config (override via env; these live in ~/.config/knowledge-genome.env) ---
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/chat")
MODEL = os.environ.get("INGEST_MODEL", "qwen2.5:14b")
NUM_CTX = int(os.environ.get("INGEST_NUM_CTX", "16384"))
TIMEOUT = int(os.environ.get("INGEST_TIMEOUT", "600"))
TODAY = datetime.date.today().isoformat()
def die(stage, reason):
print(json.dumps({"status": "error", "stage": stage, "reason": reason}))
sys.exit(1)
# --- args + pre-flight (mirror the old skill's guards, enforced in code) ---
if len(sys.argv) < 3:
die("args", "usage: ingest-semantic.py <genome> <raw/rel/path.md>")
genome = sys.argv[1]
raw_rel = sys.argv[2].lstrip("./")
if "private/" in raw_rel or raw_rel.startswith("private"):
die("preflight", "refusing private source: " + raw_rel)
if os.environ.get("PRIVATE_CONTEXT", "disabled") != "disabled":
die("preflight", "PRIVATE_CONTEXT must be disabled")
if not raw_rel.startswith("raw/"):
die("preflight", "source must live under raw/: " + raw_rel)
if not os.path.isfile(raw_rel):
die("preflight", "source not found in cwd: " + raw_rel)
with open(raw_rel, "r", encoding="utf-8") as fh:
source_text = fh.read()
if not source_text.strip():
die("preflight", "source is empty: " + raw_rel)
# --- the semantic contract (authoritative copy; SKILL.md documents it) ---
SYSTEM_PROMPT = """You perform the SEMANTIC PASS of a single source into a knowledge wiki.
Read the source and return ONLY structured data describing what it contains.
You do not write files, you do not produce frontmatter, and you do not invent
paths, slugs, branches, commits or PRs a deterministic script does all of that.
Rules:
- source_summary: a faithful, self-contained summary of the source, in the
source's own language. Plain prose, no markdown headings.
- key_points: the handful of concrete facts/claims worth indexing.
- entities: every person, tool, organisation or product the source names.
kind is one of person|tool|org|product. description is one or two factual
sentences. No markdown headings inside the description.
- concepts: every pattern, theory, decision or named idea the source explains.
description is one or two factual sentences.
- contradictions: ONLY when the source makes a claim that directly contradicts a
widely-known fact or contradicts itself. Otherwise return an empty list.
- Names must be the natural name of the thing; the script will normalise them.
Do not pad. Be faithful to the source."""
# --- JSON schema -> constrained decoding (Ollama structured outputs) ---
SCHEMA = {
"type": "object",
"properties": {
"source_title": {"type": "string"},
"source_summary": {"type": "string"},
"key_points": {"type": "array", "items": {"type": "string"}},
"entities": {"type": "array", "items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"kind": {"type": "string",
"enum": ["person", "tool", "org", "product"]},
"description": {"type": "string"},
},
"required": ["name", "description"],
}},
"concepts": {"type": "array", "items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
},
"required": ["name", "description"],
}},
"contradictions": {"type": "array", "items": {
"type": "object",
"properties": {
"concept": {"type": "string"},
"description": {"type": "string"},
},
"required": ["concept", "description"],
}},
"reasoning": {"type": "string"},
"pr_summary": {"type": "string"},
},
"required": ["source_title", "source_summary", "entities", "concepts"],
}
def call_model():
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content":
"Source path: " + raw_rel + "\n\n--- SOURCE START ---\n"
+ source_text + "\n--- SOURCE END ---\n\nReturn the JSON now."},
],
"format": SCHEMA, # schema-constrained generation
"stream": False,
# deterministic extraction; repetition penalties OFF for structured output
"options": {"temperature": 0.2, "repeat_penalty": 1.0, "num_ctx": NUM_CTX},
}
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(
OLLAMA_URL, data=data, headers={"Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
resp = json.loads(r.read().decode("utf-8"))
except urllib.error.URLError as e:
die("model", "ollama request failed: " + str(e))
content = ((resp.get("message") or {}).get("content") or "").strip()
# schema-constrained, but stay defensive if a model wraps it in a fence
if content.startswith("```"):
content = content.strip("`")
brace = content.find("{")
if brace >= 0:
content = content[brace:]
try:
return json.loads(content)
except json.JSONDecodeError as e:
die("model", "model did not return valid JSON: " + str(e))
# --- conform helpers (the script OWNS all structure) ---
def slugify(s):
s = re.sub(r"[^a-z0-9]+", "-", (s or "").strip().lower())
return re.sub(r"-+", "-", s).strip("-") or "untitled"
def twords(s, n=12):
s = " ".join((s or "").split())
w = s.split(" ")
return s if len(w) <= n else " ".join(w[:n]) + ""
def frontmatter(ptype, tags):
taglist = "[" + ", ".join(sorted(set(t for t in tags if t))) + "]"
return ("---\n"
f"type: {ptype}\n"
f"domain: {genome}\n"
"maturity: draft\n"
f"last_updated: {TODAY}\n"
"private: false\n"
f"tags: {taglist}\n"
"---\n")
def write_new(path, ptype, title, body, tags):
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(frontmatter(ptype, tags))
f.write(f"\n# {title}\n\n{body}\n")
def append_section(path, source_slug, body):
# never overwrite an existing page: accumulate, attributed to the new source
with open(path, "a", encoding="utf-8") as f:
f.write(f"\n\n## From [[sources/{source_slug}]]\n\n{body}\n")
try: # best-effort bump of last_updated in the existing frontmatter
with open(path, "r", encoding="utf-8") as f:
txt = f.read()
txt = re.sub(r"(?m)^last_updated:.*$", "last_updated: " + TODAY, txt, count=1)
with open(path, "w", encoding="utf-8") as f:
f.write(txt)
except Exception:
pass
# --- run the semantic pass ---
sem = call_model()
source_slug = slugify(os.path.splitext(os.path.basename(raw_rel))[0])
pages = []
# 1. source page — canonical summary of THIS source (re)written
src_path = f"wiki/sources/{source_slug}.md"
src_status = "modified" if os.path.exists(src_path) else "created"
kp_lines = "\n".join("- " + p for p in (sem.get("key_points") or []) if p.strip())
src_body = (sem.get("source_summary") or "").strip()
if kp_lines:
src_body += "\n\n## Key points\n\n" + kp_lines
src_body += f"\n\n## Source\n\n- [[{raw_rel}]]\n"
src_tags = ([slugify(e.get("name", "")) for e in sem.get("entities", [])]
+ [slugify(c.get("name", "")) for c in sem.get("concepts", [])])[:8]
os.makedirs("wiki/sources", exist_ok=True)
with open(src_path, "w", encoding="utf-8") as f:
f.write(frontmatter("source", src_tags))
f.write(f"\n# {sem.get('source_title') or source_slug}\n\n{src_body}\n")
pages.append({"path": src_path,
"summary": twords(sem.get("source_title") or source_slug),
"maturity": "draft", "status": src_status})
def handle(kind_dir, ptype, items):
for it in items or []:
name = (it.get("name") or "").strip()
if not name:
continue
slug = slugify(name)
path = f"wiki/{kind_dir}/{slug}.md"
desc = (it.get("description") or "").strip()
if os.path.exists(path):
append_section(path, source_slug, desc)
pages.append({"path": path, "summary": twords(desc), "status": "modified"})
else:
body = desc + f"\n\n## Sources\n\n- [[sources/{source_slug}]]\n"
write_new(path, ptype, name, body, [genome, ptype])
pages.append({"path": path, "summary": twords(desc),
"maturity": "draft", "status": "created"})
# 2. entities, 3. concepts
handle("entities", "entity", sem.get("entities", []))
handle("concepts", "concept", sem.get("concepts", []))
# 4. contradictions -> conflict pages (run-ingest routes wiki/queries/conflict-*)
conflicts = sem.get("contradictions") or []
conf_slugs = []
for c in conflicts:
cslug = slugify(c.get("concept", "unknown"))
conf_slugs.append(cslug)
path = f"wiki/queries/conflict-{cslug}-{TODAY}.md"
write_new(path, "query", f"Conflict: {c.get('concept', '')}",
(c.get("description") or "").strip()
+ f"\n\n## Source\n\n- [[sources/{source_slug}]]\n",
[genome, "conflict"])
pages.append({"path": path, "summary": "", "maturity": "draft",
"status": "created"})
contradictions_str = ("None" if not conflicts
else f"{len(conflicts)} conflict file(s) created — "
+ ", ".join(conf_slugs))
# --- write the manifest in EXACTLY run-ingest.sh's schema ---
manifest = {
"raw_source": raw_rel,
"reasoning": sem.get("reasoning") or ("Ingest of " + raw_rel),
"pr_summary": sem.get("pr_summary") or ("Semantic ingest of " + raw_rel),
"contradictions": contradictions_str,
"pages": pages,
}
with open(".ingest-manifest.json", "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2, ensure_ascii=False)
print(json.dumps({"status": "ok", "stage": "semantic",
"pages": len(pages), "model": MODEL,
"manifest": ".ingest-manifest.json"}))