feat(ingest): Implement 'light' semantic ingest with ingest-semantic.py

2026-06-18 15:26:53 +02:00 · 2026-06-18 15:26:53 +02:00 · fdd7e1e92b
commit fdd7e1e92b
parent d207a0fc91
2 changed files with 346 additions and 70 deletions
--- a/skills/ingest/SKILL.md
+++ b/skills/ingest/SKILL.md
@ -1,93 +1,92 @@
 ---
 name: ingest
-description: Semantic pass of a single raw source into the current genome's wiki — read the source, write sources/entities/concepts, handle contradictions, then emit a manifest and STOP. Use when a new file lands in raw/. Does NOT do git, log, index, lint, or PRs (a post-processor handles those), and does NOT handle private sources or project repos.
+description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR.
 license: see repository
-compatibility: Runs inside one genome checkout (cwd = genome root). Tools needed — read, edit only. NO bash, NO git. The deterministic steps (index, log, scoped lint, PR) run AFTER you exit, via run-ingest.sh. PRIVATE_CONTEXT must be disabled.
+compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
 allowed-tools: read edit
 metadata:
  framework: knowledge-genome
  phase: "1-ingest-semantic"
  mode: structured-json # lightweight agent + deterministic conform
 ---
-# Ingest — semantic pass
+# Ingest — semantic pass (structured-JSON)
-You run inside ONE genome checkout. `AGENTS.md` (already in your context) is the
+This is the **light** semantic pass. The model's only job is to read one source
-authoritative contract. Your job is the **semantic pass only**: read the source, write
+and return a single JSON object describing what the source contains. It does
-the wiki pages, handle contradictions. You do **not** touch git, the log, the index, the
+**not** write files, choose paths, produce frontmatter, pick slugs, or touch
-linter, or PRs — a post-processor (`run-ingest.sh`) does all of that _after you stop_,
+git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`,
-from the manifest you leave behind. This keeps your context clean and your turns few,
+which conforms the model's JSON into wiki pages with enforced kebab-case paths
-which matters on a small local model.
+and frontmatter, and writes `.ingest-manifest.json` in the exact schema
 `run-ingest.sh` consumes. This keeps the agent minimal and makes the output
 impossible to mis-shape, regardless of how small or quirky the local model is.
-**Argument:** the relative path of the single raw source to ingest
+Pipeline:
 (e.g. `raw/articles/foo.md`). Process only this one.
-## Pre-flight — stop the session if any check fails
+    cd <genome checkout>
    scripts/ingest-semantic.py <genome> raw/articles/<file>.md   # phase 1 (this)
    scripts/run-ingest.sh      <genome>                          # phase 2 (deterministic)
-1. Refuse if the argument path is under any `private/` directory.
+## Pre-flight (enforced by ingest-semantic.py, not by the model)
 1. Refuse if the source path is under any `private/` directory.
 2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
-3. Confirm the file exists under `raw/`.
+3. Confirm the file exists under `raw/` and is non-empty.
-## Semantic work (your only job)
+## What the model returns (the only contract)
-1. Read the source once.
+A single JSON object, decoding-constrained to this shape via Ollama's `format`:
 2. Write `wiki/sources/<kebab-slug>.md` — faithful summary + key points, with the required
   frontmatter (`type: source`, `domain: <genome>`, `maturity: draft`,
   `last_updated: <today>`, `private: false`, sensible `tags`).
 3. For each entity (person, tool, org) → create or update `wiki/entities/<kebab-name>.md`.
 4. For each concept (pattern, theory, decision) → create or update
   `wiki/concepts/<kebab-name>.md`.
 5. On a real contradiction with an existing claim, follow `AGENTS.md` §Conflict: create
   `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`. Never overwrite the existing page.
 **Naming — you are the sole author of these names; nothing renames your files.** Use
 minimal kebab-case: lowercase letters, digits and hyphens only — no spaces, no underscores,
 no capitals. Pick stable names so the same entity is never created twice (always `acme`,
 never also `acme-corp`). The path you write a file to MUST be byte-for-byte the path you
 list in the manifest.
 **Deciding create-vs-update and spotting contradictions — mind the context budget.** Use
 `wiki/index.md` to locate existing pages, then read **only** the handful that _this source
 actually names_ — the entities and concepts in the source's title and opening paragraphs —
 not everything the index lists. When in doubt, read fewer: a missed cross-link is far
 cheaper than a saturated context. Never scan whole directories.
 ## Finish: write the manifest, then STOP
 As your **final action**, write `.ingest-manifest.json` at the genome root
 (NOT under `wiki/`) describing exactly what you did. Then stop — do not commit, lint,
 append to the log/index, or open anything.
 ```json
 {
-  "raw_source": "raw/articles/foo.md",
+  "source_title": "Human title of the source",
-  "reasoning": "One sentence for the log: what changed and why.",
+  "source_summary": "Faithful, self-contained prose summary of the source.",
-  "pr_summary": "One or two sentences describing this ingest for the PR.",
+  "key_points": ["Concrete fact or claim worth indexing", "..."],
-  "contradictions": "None   (or: 1 conflict file created — <concept>)",
+  "entities": [
-  "pages": [
+    { "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
-    {
+  ],
-      "path": "wiki/sources/foo.md",
+  "concepts": [
-      "summary": "One-line index summary.",
+    { "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
-      "maturity": "draft",
+  ],
-      "status": "created"
+  "contradictions": [
-    },
+    { "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
-    {
+  ],
-      "path": "wiki/entities/acme.md",
+  "reasoning": "One sentence for the log: what this source adds.",
-      "summary": "Acme — vendor.",
+  "pr_summary": "One or two sentences describing this ingest for the PR."
      "status": "modified"
    }
  ]
 }
 ```
-Manifest rules:
+Field rules (guidance for the model; the script enforces _structure_):
- List every page you created or modified, with `status` `created` or `modified`.
+- `source_summary` is faithful and in the source's own language. No markdown
- `summary` is the one-line index description (≈12 words max). For conflict pages the
+  headings inside any description field. No padding.
-  summary is ignored — the index lists conflicts by slug only.
+- `entities` = every person, tool, org or product the source names. `kind` ∈
- `maturity` is required only on `created` pages (it seeds the new index entry). It is
+  `person|tool|org|product`. `description` = one or two factual sentences.
-  ignored for `modified` pages, so omit it there.
+- `concepts` = every pattern, theory, decision or named idea the source explains.
- Do NOT add a `model` field — the orchestrator records which model produced this run; you
+- `contradictions` = only a claim that directly contradicts a widely-known fact
-  cannot know your own model name reliably, so do not guess one.
+  or contradicts the source itself; otherwise an empty list.
- Do not invent a `run_id`, branch, commit, or PR — those belong to the post-processor.
+- Names are the natural name of the thing. The script normalises them to
  kebab-case and guarantees a single stable page per entity/concept.
-One source per session. After writing the manifest, stop.
+## What the conform script guarantees (so the model cannot break it)
 - **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
  `wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
 - **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
  underscores / capitals).
 - **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
  `last_updated: <today>`, `private: false`, `tags`.
 - **Create-vs-update:** existing entity/concept pages are **appended to** (a
  section attributed to the new source), never overwritten. The source page is
  the canonical summary of that exact source and is (re)written.
 - **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
  `pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
  `status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
  validates.
 The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
 not self-report it. No `run_id`, branch, commit or PR is invented here — those
 belong to phase 2.
 > Interactive use of `pi` (TUI) is unaffected and still available for manual
 > exploration. The **automated** ingest path no longer relies on `pi` or on
 > native tool-calling: it is the single schema-constrained call above.
--- a/skills/ingest/scripts/ingest-semantic.py
+++ b/skills/ingest/scripts/ingest-semantic.py
@ -0,0 +1,277 @@
 #!/usr/bin/env python3
 # =============================================================================
 # skills/ingest/scripts/ingest-semantic.py
 # Phase 1 (semantic) of the Knowledge Genome ingest — the LIGHT version.
 #
 # The model does ONLY semantic extraction and returns ONE schema-constrained JSON
 # object (no tools, no file writing, no git, no frontmatter, no slugs). This script
 # then CONFORMS that output deterministically into wiki pages with enforced
 # frontmatter + kebab-case paths, and writes a .ingest-manifest.json in EXACTLY the
 # schema run-ingest.sh expects. run-ingest.sh (phase 2) then does index / log /
 # scoped-lint / PR, unchanged.
 #
 #   cd <genome checkout>
 #   ingest-semantic.py <genome> raw/articles/<file>.md      # phase 1 (this)
 #   run-ingest.sh      <genome>                             # phase 2 (deterministic)
 #
 # Why this shape: local tool-calling via pi/ollama proved fragile, and a small
 # model does not reliably honour folders / naming / frontmatter / manifest schema
 # when it writes files itself. Here the model cannot break the contract because it
 # never touches the filesystem — the script owns all structure. Stdlib only.
 #
 # Emits a single JSON status line on stdout (for n8n / logs).
 # =============================================================================
 import json, os, re, sys, datetime, urllib.request, urllib.error
 # --- config (override via env; these live in ~/.config/knowledge-genome.env) ---
 OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/chat")
 MODEL      = os.environ.get("INGEST_MODEL", "qwen2.5:14b")
 NUM_CTX    = int(os.environ.get("INGEST_NUM_CTX", "16384"))
 TIMEOUT    = int(os.environ.get("INGEST_TIMEOUT", "600"))
 TODAY      = datetime.date.today().isoformat()
 def die(stage, reason):
    print(json.dumps({"status": "error", "stage": stage, "reason": reason}))
    sys.exit(1)
 # --- args + pre-flight (mirror the old skill's guards, enforced in code) ---
 if len(sys.argv) < 3:
    die("args", "usage: ingest-semantic.py <genome> <raw/rel/path.md>")
 genome  = sys.argv[1]
 raw_rel = sys.argv[2].lstrip("./")
 if "private/" in raw_rel or raw_rel.startswith("private"):
    die("preflight", "refusing private source: " + raw_rel)
 if os.environ.get("PRIVATE_CONTEXT", "disabled") != "disabled":
    die("preflight", "PRIVATE_CONTEXT must be disabled")
 if not raw_rel.startswith("raw/"):
    die("preflight", "source must live under raw/: " + raw_rel)
 if not os.path.isfile(raw_rel):
    die("preflight", "source not found in cwd: " + raw_rel)
 with open(raw_rel, "r", encoding="utf-8") as fh:
    source_text = fh.read()
 if not source_text.strip():
    die("preflight", "source is empty: " + raw_rel)
 # --- the semantic contract (authoritative copy; SKILL.md documents it) ---
 SYSTEM_PROMPT = """You perform the SEMANTIC PASS of a single source into a knowledge wiki.
 Read the source and return ONLY structured data describing what it contains.
 You do not write files, you do not produce frontmatter, and you do not invent
 paths, slugs, branches, commits or PRs — a deterministic script does all of that.
 Rules:
 - source_summary: a faithful, self-contained summary of the source, in the
  source's own language. Plain prose, no markdown headings.
 - key_points: the handful of concrete facts/claims worth indexing.
 - entities: every person, tool, organisation or product the source names.
  kind is one of person|tool|org|product. description is one or two factual
  sentences. No markdown headings inside the description.
 - concepts: every pattern, theory, decision or named idea the source explains.
  description is one or two factual sentences.
 - contradictions: ONLY when the source makes a claim that directly contradicts a
  widely-known fact or contradicts itself. Otherwise return an empty list.
 - Names must be the natural name of the thing; the script will normalise them.
 Do not pad. Be faithful to the source."""
 # --- JSON schema -> constrained decoding (Ollama structured outputs) ---
 SCHEMA = {
    "type": "object",
    "properties": {
        "source_title":   {"type": "string"},
        "source_summary": {"type": "string"},
        "key_points":     {"type": "array", "items": {"type": "string"}},
        "entities": {"type": "array", "items": {
            "type": "object",
            "properties": {
                "name":        {"type": "string"},
                "kind":        {"type": "string",
                                "enum": ["person", "tool", "org", "product"]},
                "description": {"type": "string"},
            },
            "required": ["name", "description"],
        }},
        "concepts": {"type": "array", "items": {
            "type": "object",
            "properties": {
                "name":        {"type": "string"},
                "description": {"type": "string"},
            },
            "required": ["name", "description"],
        }},
        "contradictions": {"type": "array", "items": {
            "type": "object",
            "properties": {
                "concept":     {"type": "string"},
                "description": {"type": "string"},
            },
            "required": ["concept", "description"],
        }},
        "reasoning":  {"type": "string"},
        "pr_summary": {"type": "string"},
    },
    "required": ["source_title", "source_summary", "entities", "concepts"],
 }
 def call_model():
    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content":
                "Source path: " + raw_rel + "\n\n--- SOURCE START ---\n"
                + source_text + "\n--- SOURCE END ---\n\nReturn the JSON now."},
        ],
        "format": SCHEMA,          # schema-constrained generation
        "stream": False,
        # deterministic extraction; repetition penalties OFF for structured output
        "options": {"temperature": 0.2, "repeat_penalty": 1.0, "num_ctx": NUM_CTX},
    }
    data = json.dumps(payload).encode("utf-8")
    req = urllib.request.Request(
        OLLAMA_URL, data=data, headers={"Content-Type": "application/json"})
    try:
        with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
            resp = json.loads(r.read().decode("utf-8"))
    except urllib.error.URLError as e:
        die("model", "ollama request failed: " + str(e))
    content = ((resp.get("message") or {}).get("content") or "").strip()
    # schema-constrained, but stay defensive if a model wraps it in a fence
    if content.startswith("```"):
        content = content.strip("`")
        brace = content.find("{")
        if brace >= 0:
            content = content[brace:]
    try:
        return json.loads(content)
    except json.JSONDecodeError as e:
        die("model", "model did not return valid JSON: " + str(e))
 # --- conform helpers (the script OWNS all structure) ---
 def slugify(s):
    s = re.sub(r"[^a-z0-9]+", "-", (s or "").strip().lower())
    return re.sub(r"-+", "-", s).strip("-") or "untitled"
 def twords(s, n=12):
    s = " ".join((s or "").split())
    w = s.split(" ")
    return s if len(w) <= n else " ".join(w[:n]) + "…"
 def frontmatter(ptype, tags):
    taglist = "[" + ", ".join(sorted(set(t for t in tags if t))) + "]"
    return ("---\n"
            f"type: {ptype}\n"
            f"domain: {genome}\n"
            "maturity: draft\n"
            f"last_updated: {TODAY}\n"
            "private: false\n"
            f"tags: {taglist}\n"
            "---\n")
 def write_new(path, ptype, title, body, tags):
    os.makedirs(os.path.dirname(path), exist_ok=True)
    with open(path, "w", encoding="utf-8") as f:
        f.write(frontmatter(ptype, tags))
        f.write(f"\n# {title}\n\n{body}\n")
 def append_section(path, source_slug, body):
    # never overwrite an existing page: accumulate, attributed to the new source
    with open(path, "a", encoding="utf-8") as f:
        f.write(f"\n\n## From [[sources/{source_slug}]]\n\n{body}\n")
    try:  # best-effort bump of last_updated in the existing frontmatter
        with open(path, "r", encoding="utf-8") as f:
            txt = f.read()
        txt = re.sub(r"(?m)^last_updated:.*$", "last_updated: " + TODAY, txt, count=1)
        with open(path, "w", encoding="utf-8") as f:
            f.write(txt)
    except Exception:
        pass
 # --- run the semantic pass ---
 sem = call_model()
 source_slug = slugify(os.path.splitext(os.path.basename(raw_rel))[0])
 pages = []
 # 1. source page — canonical summary of THIS source (re)written
 src_path   = f"wiki/sources/{source_slug}.md"
 src_status = "modified" if os.path.exists(src_path) else "created"
 kp_lines   = "\n".join("- " + p for p in (sem.get("key_points") or []) if p.strip())
 src_body   = (sem.get("source_summary") or "").strip()
 if kp_lines:
    src_body += "\n\n## Key points\n\n" + kp_lines
 src_body += f"\n\n## Source\n\n- [[{raw_rel}]]\n"
 src_tags = ([slugify(e.get("name", "")) for e in sem.get("entities", [])]
            + [slugify(c.get("name", "")) for c in sem.get("concepts", [])])[:8]
 os.makedirs("wiki/sources", exist_ok=True)
 with open(src_path, "w", encoding="utf-8") as f:
    f.write(frontmatter("source", src_tags))
    f.write(f"\n# {sem.get('source_title') or source_slug}\n\n{src_body}\n")
 pages.append({"path": src_path,
              "summary": twords(sem.get("source_title") or source_slug),
              "maturity": "draft", "status": src_status})
 def handle(kind_dir, ptype, items):
    for it in items or []:
        name = (it.get("name") or "").strip()
        if not name:
            continue
        slug = slugify(name)
        path = f"wiki/{kind_dir}/{slug}.md"
        desc = (it.get("description") or "").strip()
        if os.path.exists(path):
            append_section(path, source_slug, desc)
            pages.append({"path": path, "summary": twords(desc), "status": "modified"})
        else:
            body = desc + f"\n\n## Sources\n\n- [[sources/{source_slug}]]\n"
            write_new(path, ptype, name, body, [genome, ptype])
            pages.append({"path": path, "summary": twords(desc),
                          "maturity": "draft", "status": "created"})
 # 2. entities, 3. concepts
 handle("entities", "entity", sem.get("entities", []))
 handle("concepts", "concept", sem.get("concepts", []))
 # 4. contradictions -> conflict pages (run-ingest routes wiki/queries/conflict-*)
 conflicts = sem.get("contradictions") or []
 conf_slugs = []
 for c in conflicts:
    cslug = slugify(c.get("concept", "unknown"))
    conf_slugs.append(cslug)
    path = f"wiki/queries/conflict-{cslug}-{TODAY}.md"
    write_new(path, "query", f"Conflict: {c.get('concept', '')}",
              (c.get("description") or "").strip()
              + f"\n\n## Source\n\n- [[sources/{source_slug}]]\n",
              [genome, "conflict"])
    pages.append({"path": path, "summary": "", "maturity": "draft",
                  "status": "created"})
 contradictions_str = ("None" if not conflicts
                      else f"{len(conflicts)} conflict file(s) created — "
                           + ", ".join(conf_slugs))
 # --- write the manifest in EXACTLY run-ingest.sh's schema ---
 manifest = {
    "raw_source":     raw_rel,
    "reasoning":      sem.get("reasoning") or ("Ingest of " + raw_rel),
    "pr_summary":     sem.get("pr_summary") or ("Semantic ingest of " + raw_rel),
    "contradictions": contradictions_str,
    "pages":          pages,
 }
 with open(".ingest-manifest.json", "w", encoding="utf-8") as f:
    json.dump(manifest, f, indent=2, ensure_ascii=False)
 print(json.dumps({"status": "ok", "stage": "semantic",
                  "pages": len(pages), "model": MODEL,
                  "manifest": ".ingest-manifest.json"}))