Merge branch 'release/1.3.0' into main

2026-06-18 15:28:19 +02:00 · 2026-06-18 15:28:19 +02:00 · 4b99b0acd2
commit 4b99b0acd2
parent 15fc829e46 b7b5da0c3b
4 changed files with 361 additions and 74 deletions
--- a/2
+++ b/2
@ -1,5 +1,5 @@
 # =============================================================================
-# Knowledge Genome - Makefile v. 1.2.5
+# Knowledge Genome - Makefile v. 1.3.0
 # Orchestrates the setup and management of the knowledge base.
 # =============================================================================

--- a/skills/ingest/SKILL.md
+++ b/skills/ingest/SKILL.md
@ -1,93 +1,92 @@
 ---
 name: ingest
-description: Semantic pass of a single raw source into the current genome's wiki — read the source, write sources/entities/concepts, handle contradictions, then emit a manifest and STOP. Use when a new file lands in raw/. Does NOT do git, log, index, lint, or PRs (a post-processor handles those), and does NOT handle private sources or project repos.
+description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR.
 license: see repository
-compatibility: Runs inside one genome checkout (cwd = genome root). Tools needed — read, edit only. NO bash, NO git. The deterministic steps (index, log, scoped lint, PR) run AFTER you exit, via run-ingest.sh. PRIVATE_CONTEXT must be disabled.
-allowed-tools: read edit
+compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
 metadata:
  framework: knowledge-genome
  phase: "1-ingest-semantic"
+  mode: structured-json # lightweight agent + deterministic conform
 ---

-# Ingest — semantic pass
+# Ingest — semantic pass (structured-JSON)

-You run inside ONE genome checkout. `AGENTS.md` (already in your context) is the
-authoritative contract. Your job is the **semantic pass only**: read the source, write
-the wiki pages, handle contradictions. You do **not** touch git, the log, the index, the
-linter, or PRs — a post-processor (`run-ingest.sh`) does all of that _after you stop_,
-from the manifest you leave behind. This keeps your context clean and your turns few,
-which matters on a small local model.
+This is the **light** semantic pass. The model's only job is to read one source
+and return a single JSON object describing what the source contains. It does
+**not** write files, choose paths, produce frontmatter, pick slugs, or touch
+git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`,
+which conforms the model's JSON into wiki pages with enforced kebab-case paths
+and frontmatter, and writes `.ingest-manifest.json` in the exact schema
+`run-ingest.sh` consumes. This keeps the agent minimal and makes the output
+impossible to mis-shape, regardless of how small or quirky the local model is.

-**Argument:** the relative path of the single raw source to ingest
-(e.g. `raw/articles/foo.md`). Process only this one.
+Pipeline:

-## Pre-flight — stop the session if any check fails
+    cd <genome checkout>
+    scripts/ingest-semantic.py <genome> raw/articles/<file>.md   # phase 1 (this)
+    scripts/run-ingest.sh      <genome>                          # phase 2 (deterministic)

-1. Refuse if the argument path is under any `private/` directory.
+## Pre-flight (enforced by ingest-semantic.py, not by the model)
+
+1. Refuse if the source path is under any `private/` directory.
 2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
-3. Confirm the file exists under `raw/`.
+3. Confirm the file exists under `raw/` and is non-empty.

-## Semantic work (your only job)
+## What the model returns (the only contract)

-1. Read the source once.
-2. Write `wiki/sources/<kebab-slug>.md` — faithful summary + key points, with the required
-   frontmatter (`type: source`, `domain: <genome>`, `maturity: draft`,
-   `last_updated: <today>`, `private: false`, sensible `tags`).
-3. For each entity (person, tool, org) → create or update `wiki/entities/<kebab-name>.md`.
-4. For each concept (pattern, theory, decision) → create or update
-   `wiki/concepts/<kebab-name>.md`.
-5. On a real contradiction with an existing claim, follow `AGENTS.md` §Conflict: create
-   `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`. Never overwrite the existing page.
-
-**Naming — you are the sole author of these names; nothing renames your files.** Use
-minimal kebab-case: lowercase letters, digits and hyphens only — no spaces, no underscores,
-no capitals. Pick stable names so the same entity is never created twice (always `acme`,
-never also `acme-corp`). The path you write a file to MUST be byte-for-byte the path you
-list in the manifest.
-
-**Deciding create-vs-update and spotting contradictions — mind the context budget.** Use
-`wiki/index.md` to locate existing pages, then read **only** the handful that _this source
-actually names_ — the entities and concepts in the source's title and opening paragraphs —
-not everything the index lists. When in doubt, read fewer: a missed cross-link is far
-cheaper than a saturated context. Never scan whole directories.
-
-## Finish: write the manifest, then STOP
-
-As your **final action**, write `.ingest-manifest.json` at the genome root
-(NOT under `wiki/`) describing exactly what you did. Then stop — do not commit, lint,
-append to the log/index, or open anything.
+A single JSON object, decoding-constrained to this shape via Ollama's `format`:

 ```json
 {
-  "raw_source": "raw/articles/foo.md",
-  "reasoning": "One sentence for the log: what changed and why.",
-  "pr_summary": "One or two sentences describing this ingest for the PR.",
-  "contradictions": "None   (or: 1 conflict file created — <concept>)",
-  "pages": [
-    {
-      "path": "wiki/sources/foo.md",
-      "summary": "One-line index summary.",
-      "maturity": "draft",
-      "status": "created"
-    },
-    {
-      "path": "wiki/entities/acme.md",
-      "summary": "Acme — vendor.",
-      "status": "modified"
-    }
-  ]
+  "source_title": "Human title of the source",
+  "source_summary": "Faithful, self-contained prose summary of the source.",
+  "key_points": ["Concrete fact or claim worth indexing", "..."],
+  "entities": [
+    { "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
+  ],
+  "concepts": [
+    { "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
+  ],
+  "contradictions": [
+    { "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
+  ],
+  "reasoning": "One sentence for the log: what this source adds.",
+  "pr_summary": "One or two sentences describing this ingest for the PR."
 }
 ```

-Manifest rules:
+Field rules (guidance for the model; the script enforces _structure_):

- List every page you created or modified, with `status` `created` or `modified`.
- `summary` is the one-line index description (≈12 words max). For conflict pages the
-  summary is ignored — the index lists conflicts by slug only.
- `maturity` is required only on `created` pages (it seeds the new index entry). It is
-  ignored for `modified` pages, so omit it there.
- Do NOT add a `model` field — the orchestrator records which model produced this run; you
-  cannot know your own model name reliably, so do not guess one.
- Do not invent a `run_id`, branch, commit, or PR — those belong to the post-processor.
+- `source_summary` is faithful and in the source's own language. No markdown
+  headings inside any description field. No padding.
+- `entities` = every person, tool, org or product the source names. `kind` ∈
+  `person|tool|org|product`. `description` = one or two factual sentences.
+- `concepts` = every pattern, theory, decision or named idea the source explains.
+- `contradictions` = only a claim that directly contradicts a widely-known fact
+  or contradicts the source itself; otherwise an empty list.
+- Names are the natural name of the thing. The script normalises them to
+  kebab-case and guarantees a single stable page per entity/concept.

-One source per session. After writing the manifest, stop.
+## What the conform script guarantees (so the model cannot break it)
+
+- **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
+  `wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
+- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
+  underscores / capitals).
+- **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
+  `last_updated: <today>`, `private: false`, `tags`.
+- **Create-vs-update:** existing entity/concept pages are **appended to** (a
+  section attributed to the new source), never overwritten. The source page is
+  the canonical summary of that exact source and is (re)written.
+- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
+  `pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
+  `status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
+  validates.
+
+The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
+not self-report it. No `run_id`, branch, commit or PR is invented here — those
+belong to phase 2.
+
+> Interactive use of `pi` (TUI) is unaffected and still available for manual
+> exploration. The **automated** ingest path no longer relies on `pi` or on
+> native tool-calling: it is the single schema-constrained call above.
--- a/skills/ingest/scripts/ingest-semantic.py
+++ b/skills/ingest/scripts/ingest-semantic.py
@ -0,0 +1,277 @@
+#!/usr/bin/env python3
+# =============================================================================
+# skills/ingest/scripts/ingest-semantic.py
+# Phase 1 (semantic) of the Knowledge Genome ingest — the LIGHT version.
+#
+# The model does ONLY semantic extraction and returns ONE schema-constrained JSON
+# object (no tools, no file writing, no git, no frontmatter, no slugs). This script
+# then CONFORMS that output deterministically into wiki pages with enforced
+# frontmatter + kebab-case paths, and writes a .ingest-manifest.json in EXACTLY the
+# schema run-ingest.sh expects. run-ingest.sh (phase 2) then does index / log /
+# scoped-lint / PR, unchanged.
+#
+#   cd <genome checkout>
+#   ingest-semantic.py <genome> raw/articles/<file>.md      # phase 1 (this)
+#   run-ingest.sh      <genome>                             # phase 2 (deterministic)
+#
+# Why this shape: local tool-calling via pi/ollama proved fragile, and a small
+# model does not reliably honour folders / naming / frontmatter / manifest schema
+# when it writes files itself. Here the model cannot break the contract because it
+# never touches the filesystem — the script owns all structure. Stdlib only.
+#
+# Emits a single JSON status line on stdout (for n8n / logs).
+# =============================================================================
+import json, os, re, sys, datetime, urllib.request, urllib.error
+
+# --- config (override via env; these live in ~/.config/knowledge-genome.env) ---
+OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/chat")
+MODEL      = os.environ.get("INGEST_MODEL", "qwen2.5:14b")
+NUM_CTX    = int(os.environ.get("INGEST_NUM_CTX", "16384"))
+TIMEOUT    = int(os.environ.get("INGEST_TIMEOUT", "600"))
+TODAY      = datetime.date.today().isoformat()
+
+
+def die(stage, reason):
+    print(json.dumps({"status": "error", "stage": stage, "reason": reason}))
+    sys.exit(1)
+
+
+# --- args + pre-flight (mirror the old skill's guards, enforced in code) ---
+if len(sys.argv) < 3:
+    die("args", "usage: ingest-semantic.py <genome> <raw/rel/path.md>")
+genome  = sys.argv[1]
+raw_rel = sys.argv[2].lstrip("./")
+
+if "private/" in raw_rel or raw_rel.startswith("private"):
+    die("preflight", "refusing private source: " + raw_rel)
+if os.environ.get("PRIVATE_CONTEXT", "disabled") != "disabled":
+    die("preflight", "PRIVATE_CONTEXT must be disabled")
+if not raw_rel.startswith("raw/"):
+    die("preflight", "source must live under raw/: " + raw_rel)
+if not os.path.isfile(raw_rel):
+    die("preflight", "source not found in cwd: " + raw_rel)
+
+with open(raw_rel, "r", encoding="utf-8") as fh:
+    source_text = fh.read()
+if not source_text.strip():
+    die("preflight", "source is empty: " + raw_rel)
+
+
+# --- the semantic contract (authoritative copy; SKILL.md documents it) ---
+SYSTEM_PROMPT = """You perform the SEMANTIC PASS of a single source into a knowledge wiki.
+Read the source and return ONLY structured data describing what it contains.
+You do not write files, you do not produce frontmatter, and you do not invent
+paths, slugs, branches, commits or PRs — a deterministic script does all of that.
+
+Rules:
+- source_summary: a faithful, self-contained summary of the source, in the
+  source's own language. Plain prose, no markdown headings.
+- key_points: the handful of concrete facts/claims worth indexing.
+- entities: every person, tool, organisation or product the source names.
+  kind is one of person|tool|org|product. description is one or two factual
+  sentences. No markdown headings inside the description.
+- concepts: every pattern, theory, decision or named idea the source explains.
+  description is one or two factual sentences.
+- contradictions: ONLY when the source makes a claim that directly contradicts a
+  widely-known fact or contradicts itself. Otherwise return an empty list.
+- Names must be the natural name of the thing; the script will normalise them.
+Do not pad. Be faithful to the source."""
+
+# --- JSON schema -> constrained decoding (Ollama structured outputs) ---
+SCHEMA = {
+    "type": "object",
+    "properties": {
+        "source_title":   {"type": "string"},
+        "source_summary": {"type": "string"},
+        "key_points":     {"type": "array", "items": {"type": "string"}},
+        "entities": {"type": "array", "items": {
+            "type": "object",
+            "properties": {
+                "name":        {"type": "string"},
+                "kind":        {"type": "string",
+                                "enum": ["person", "tool", "org", "product"]},
+                "description": {"type": "string"},
+            },
+            "required": ["name", "description"],
+        }},
+        "concepts": {"type": "array", "items": {
+            "type": "object",
+            "properties": {
+                "name":        {"type": "string"},
+                "description": {"type": "string"},
+            },
+            "required": ["name", "description"],
+        }},
+        "contradictions": {"type": "array", "items": {
+            "type": "object",
+            "properties": {
+                "concept":     {"type": "string"},
+                "description": {"type": "string"},
+            },
+            "required": ["concept", "description"],
+        }},
+        "reasoning":  {"type": "string"},
+        "pr_summary": {"type": "string"},
+    },
+    "required": ["source_title", "source_summary", "entities", "concepts"],
+}
+
+
+def call_model():
+    payload = {
+        "model": MODEL,
+        "messages": [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content":
+                "Source path: " + raw_rel + "\n\n--- SOURCE START ---\n"
+                + source_text + "\n--- SOURCE END ---\n\nReturn the JSON now."},
+        ],
+        "format": SCHEMA,          # schema-constrained generation
+        "stream": False,
+        # deterministic extraction; repetition penalties OFF for structured output
+        "options": {"temperature": 0.2, "repeat_penalty": 1.0, "num_ctx": NUM_CTX},
+    }
+    data = json.dumps(payload).encode("utf-8")
+    req = urllib.request.Request(
+        OLLAMA_URL, data=data, headers={"Content-Type": "application/json"})
+    try:
+        with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
+            resp = json.loads(r.read().decode("utf-8"))
+    except urllib.error.URLError as e:
+        die("model", "ollama request failed: " + str(e))
+    content = ((resp.get("message") or {}).get("content") or "").strip()
+    # schema-constrained, but stay defensive if a model wraps it in a fence
+    if content.startswith("```"):
+        content = content.strip("`")
+        brace = content.find("{")
+        if brace >= 0:
+            content = content[brace:]
+    try:
+        return json.loads(content)
+    except json.JSONDecodeError as e:
+        die("model", "model did not return valid JSON: " + str(e))
+
+
+# --- conform helpers (the script OWNS all structure) ---
+def slugify(s):
+    s = re.sub(r"[^a-z0-9]+", "-", (s or "").strip().lower())
+    return re.sub(r"-+", "-", s).strip("-") or "untitled"
+
+
+def twords(s, n=12):
+    s = " ".join((s or "").split())
+    w = s.split(" ")
+    return s if len(w) <= n else " ".join(w[:n]) + "…"
+
+
+def frontmatter(ptype, tags):
+    taglist = "[" + ", ".join(sorted(set(t for t in tags if t))) + "]"
+    return ("---\n"
+            f"type: {ptype}\n"
+            f"domain: {genome}\n"
+            "maturity: draft\n"
+            f"last_updated: {TODAY}\n"
+            "private: false\n"
+            f"tags: {taglist}\n"
+            "---\n")
+
+
+def write_new(path, ptype, title, body, tags):
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    with open(path, "w", encoding="utf-8") as f:
+        f.write(frontmatter(ptype, tags))
+        f.write(f"\n# {title}\n\n{body}\n")
+
+
+def append_section(path, source_slug, body):
+    # never overwrite an existing page: accumulate, attributed to the new source
+    with open(path, "a", encoding="utf-8") as f:
+        f.write(f"\n\n## From [[sources/{source_slug}]]\n\n{body}\n")
+    try:  # best-effort bump of last_updated in the existing frontmatter
+        with open(path, "r", encoding="utf-8") as f:
+            txt = f.read()
+        txt = re.sub(r"(?m)^last_updated:.*$", "last_updated: " + TODAY, txt, count=1)
+        with open(path, "w", encoding="utf-8") as f:
+            f.write(txt)
+    except Exception:
+        pass
+
+
+# --- run the semantic pass ---
+sem = call_model()
+source_slug = slugify(os.path.splitext(os.path.basename(raw_rel))[0])
+pages = []
+
+# 1. source page — canonical summary of THIS source (re)written
+src_path   = f"wiki/sources/{source_slug}.md"
+src_status = "modified" if os.path.exists(src_path) else "created"
+kp_lines   = "\n".join("- " + p for p in (sem.get("key_points") or []) if p.strip())
+src_body   = (sem.get("source_summary") or "").strip()
+if kp_lines:
+    src_body += "\n\n## Key points\n\n" + kp_lines
+src_body += f"\n\n## Source\n\n- [[{raw_rel}]]\n"
+src_tags = ([slugify(e.get("name", "")) for e in sem.get("entities", [])]
+            + [slugify(c.get("name", "")) for c in sem.get("concepts", [])])[:8]
+os.makedirs("wiki/sources", exist_ok=True)
+with open(src_path, "w", encoding="utf-8") as f:
+    f.write(frontmatter("source", src_tags))
+    f.write(f"\n# {sem.get('source_title') or source_slug}\n\n{src_body}\n")
+pages.append({"path": src_path,
+              "summary": twords(sem.get("source_title") or source_slug),
+              "maturity": "draft", "status": src_status})
+
+
+def handle(kind_dir, ptype, items):
+    for it in items or []:
+        name = (it.get("name") or "").strip()
+        if not name:
+            continue
+        slug = slugify(name)
+        path = f"wiki/{kind_dir}/{slug}.md"
+        desc = (it.get("description") or "").strip()
+        if os.path.exists(path):
+            append_section(path, source_slug, desc)
+            pages.append({"path": path, "summary": twords(desc), "status": "modified"})
+        else:
+            body = desc + f"\n\n## Sources\n\n- [[sources/{source_slug}]]\n"
+            write_new(path, ptype, name, body, [genome, ptype])
+            pages.append({"path": path, "summary": twords(desc),
+                          "maturity": "draft", "status": "created"})
+
+
+# 2. entities, 3. concepts
+handle("entities", "entity", sem.get("entities", []))
+handle("concepts", "concept", sem.get("concepts", []))
+
+# 4. contradictions -> conflict pages (run-ingest routes wiki/queries/conflict-*)
+conflicts = sem.get("contradictions") or []
+conf_slugs = []
+for c in conflicts:
+    cslug = slugify(c.get("concept", "unknown"))
+    conf_slugs.append(cslug)
+    path = f"wiki/queries/conflict-{cslug}-{TODAY}.md"
+    write_new(path, "query", f"Conflict: {c.get('concept', '')}",
+              (c.get("description") or "").strip()
+              + f"\n\n## Source\n\n- [[sources/{source_slug}]]\n",
+              [genome, "conflict"])
+    pages.append({"path": path, "summary": "", "maturity": "draft",
+                  "status": "created"})
+
+contradictions_str = ("None" if not conflicts
+                      else f"{len(conflicts)} conflict file(s) created — "
+                           + ", ".join(conf_slugs))
+
+# --- write the manifest in EXACTLY run-ingest.sh's schema ---
+manifest = {
+    "raw_source":     raw_rel,
+    "reasoning":      sem.get("reasoning") or ("Ingest of " + raw_rel),
+    "pr_summary":     sem.get("pr_summary") or ("Semantic ingest of " + raw_rel),
+    "contradictions": contradictions_str,
+    "pages":          pages,
+}
+with open(".ingest-manifest.json", "w", encoding="utf-8") as f:
+    json.dump(manifest, f, indent=2, ensure_ascii=False)
+
+print(json.dumps({"status": "ok", "stage": "semantic",
+                  "pages": len(pages), "model": MODEL,
+                  "manifest": ".ingest-manifest.json"}))
--- a/skills/ingest/scripts/run-ingest.sh
+++ b/skills/ingest/scripts/run-ingest.sh
@ -1,13 +1,17 @@
 #!/usr/bin/env bash
 # =============================================================================
 # skills/ingest/scripts/run-ingest.sh
-# Post-pi orchestrator. Runs OUTSIDE pi's loop, on vm101, in the genome checkout.
-# Consumes .ingest-manifest.json (written by the ingest skill) and performs every
-# deterministic step — index, log, scoped lint, PR — so pi's context stays clean.
+# Post-semantic orchestrator. Runs OUTSIDE the model, on vm101, in the genome
+# checkout. Consumes .ingest-manifest.json (written by ingest-semantic.py) and
+# performs every deterministic step — index, log, scoped lint, PR.
 #
 #   run-ingest.sh <genome_name> [manifest_path]
 #
 # Emits a single JSON result line on stdout for n8n to parse.
+#
+# every page listed in the manifest must exist on disk before we trust the run.
+# Everything else is unchanged: the manifest the semantic phase now produces is
+# already in this script's expected schema.
 # =============================================================================
 set -euo pipefail

@ -57,6 +61,13 @@ mapfile -t modified_paths < <(jq -r '.pages[] | select(.status=="modified") | .p
 all_paths=( "${created_paths[@]}" "${modified_paths[@]}" )
 [[ ${#all_paths[@]} -gt 0 ]] || fail "manifest" "no pages reported"

+# --- the semantic phase (ingest-semantic.py) writes the files; verify
+# every manifest page actually exists on disk before trusting the run. Catches any
+# drift between what the manifest claims and what was really written. ---
+for _p in "${all_paths[@]}"; do
+  [[ -f "$_p" ]] || fail "pages" "manifest lists a file not present on disk: ${_p}"
+done
+
 conflict_label=""

 # NOTE: No rollback. The steps below modify the working tree in order (index → log → commit).