diff --git a/skills/ingest/SKILL.md b/skills/ingest/SKILL.md index 1f143d7..c386dbe 100644 --- a/skills/ingest/SKILL.md +++ b/skills/ingest/SKILL.md @@ -1,93 +1,92 @@ --- name: ingest -description: Semantic pass of a single raw source into the current genome's wiki — read the source, write sources/entities/concepts, handle contradictions, then emit a manifest and STOP. Use when a new file lands in raw/. Does NOT do git, log, index, lint, or PRs (a post-processor handles those), and does NOT handle private sources or project repos. +description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR. license: see repository -compatibility: Runs inside one genome checkout (cwd = genome root). Tools needed — read, edit only. NO bash, NO git. The deterministic steps (index, log, scoped lint, PR) run AFTER you exit, via run-ingest.sh. PRIVATE_CONTEXT must be disabled. -allowed-tools: read edit +compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled. metadata: framework: knowledge-genome phase: "1-ingest-semantic" + mode: structured-json # lightweight agent + deterministic conform --- -# Ingest — semantic pass +# Ingest — semantic pass (structured-JSON) -You run inside ONE genome checkout. `AGENTS.md` (already in your context) is the -authoritative contract. Your job is the **semantic pass only**: read the source, write -the wiki pages, handle contradictions. You do **not** touch git, the log, the index, the -linter, or PRs — a post-processor (`run-ingest.sh`) does all of that _after you stop_, -from the manifest you leave behind. This keeps your context clean and your turns few, -which matters on a small local model. +This is the **light** semantic pass. The model's only job is to read one source +and return a single JSON object describing what the source contains. It does +**not** write files, choose paths, produce frontmatter, pick slugs, or touch +git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`, +which conforms the model's JSON into wiki pages with enforced kebab-case paths +and frontmatter, and writes `.ingest-manifest.json` in the exact schema +`run-ingest.sh` consumes. This keeps the agent minimal and makes the output +impossible to mis-shape, regardless of how small or quirky the local model is. -**Argument:** the relative path of the single raw source to ingest -(e.g. `raw/articles/foo.md`). Process only this one. +Pipeline: -## Pre-flight — stop the session if any check fails + cd + scripts/ingest-semantic.py raw/articles/.md # phase 1 (this) + scripts/run-ingest.sh # phase 2 (deterministic) -1. Refuse if the argument path is under any `private/` directory. +## Pre-flight (enforced by ingest-semantic.py, not by the model) + +1. Refuse if the source path is under any `private/` directory. 2. Refuse if `PRIVATE_CONTEXT` is not `disabled`. -3. Confirm the file exists under `raw/`. +3. Confirm the file exists under `raw/` and is non-empty. -## Semantic work (your only job) +## What the model returns (the only contract) -1. Read the source once. -2. Write `wiki/sources/.md` — faithful summary + key points, with the required - frontmatter (`type: source`, `domain: `, `maturity: draft`, - `last_updated: `, `private: false`, sensible `tags`). -3. For each entity (person, tool, org) → create or update `wiki/entities/.md`. -4. For each concept (pattern, theory, decision) → create or update - `wiki/concepts/.md`. -5. On a real contradiction with an existing claim, follow `AGENTS.md` §Conflict: create - `wiki/queries/conflict--.md`. Never overwrite the existing page. - -**Naming — you are the sole author of these names; nothing renames your files.** Use -minimal kebab-case: lowercase letters, digits and hyphens only — no spaces, no underscores, -no capitals. Pick stable names so the same entity is never created twice (always `acme`, -never also `acme-corp`). The path you write a file to MUST be byte-for-byte the path you -list in the manifest. - -**Deciding create-vs-update and spotting contradictions — mind the context budget.** Use -`wiki/index.md` to locate existing pages, then read **only** the handful that _this source -actually names_ — the entities and concepts in the source's title and opening paragraphs — -not everything the index lists. When in doubt, read fewer: a missed cross-link is far -cheaper than a saturated context. Never scan whole directories. - -## Finish: write the manifest, then STOP - -As your **final action**, write `.ingest-manifest.json` at the genome root -(NOT under `wiki/`) describing exactly what you did. Then stop — do not commit, lint, -append to the log/index, or open anything. +A single JSON object, decoding-constrained to this shape via Ollama's `format`: ```json { - "raw_source": "raw/articles/foo.md", - "reasoning": "One sentence for the log: what changed and why.", - "pr_summary": "One or two sentences describing this ingest for the PR.", - "contradictions": "None (or: 1 conflict file created — )", - "pages": [ - { - "path": "wiki/sources/foo.md", - "summary": "One-line index summary.", - "maturity": "draft", - "status": "created" - }, - { - "path": "wiki/entities/acme.md", - "summary": "Acme — vendor.", - "status": "modified" - } - ] + "source_title": "Human title of the source", + "source_summary": "Faithful, self-contained prose summary of the source.", + "key_points": ["Concrete fact or claim worth indexing", "..."], + "entities": [ + { "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." } + ], + "concepts": [ + { "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." } + ], + "contradictions": [ + { "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." } + ], + "reasoning": "One sentence for the log: what this source adds.", + "pr_summary": "One or two sentences describing this ingest for the PR." } ``` -Manifest rules: +Field rules (guidance for the model; the script enforces _structure_): -- List every page you created or modified, with `status` `created` or `modified`. -- `summary` is the one-line index description (≈12 words max). For conflict pages the - summary is ignored — the index lists conflicts by slug only. -- `maturity` is required only on `created` pages (it seeds the new index entry). It is - ignored for `modified` pages, so omit it there. -- Do NOT add a `model` field — the orchestrator records which model produced this run; you - cannot know your own model name reliably, so do not guess one. -- Do not invent a `run_id`, branch, commit, or PR — those belong to the post-processor. +- `source_summary` is faithful and in the source's own language. No markdown + headings inside any description field. No padding. +- `entities` = every person, tool, org or product the source names. `kind` ∈ + `person|tool|org|product`. `description` = one or two factual sentences. +- `concepts` = every pattern, theory, decision or named idea the source explains. +- `contradictions` = only a claim that directly contradicts a widely-known fact + or contradicts the source itself; otherwise an empty list. +- Names are the natural name of the thing. The script normalises them to + kebab-case and guarantees a single stable page per entity/concept. -One source per session. After writing the manifest, stop. +## What the conform script guarantees (so the model cannot break it) + +- **Paths:** `wiki/sources/.md`, `wiki/entities/.md`, + `wiki/concepts/.md`, `wiki/queries/conflict--.md`. +- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces / + underscores / capitals). +- **Frontmatter:** `type`, `domain: `, `maturity: draft`, + `last_updated: `, `private: false`, `tags`. +- **Create-vs-update:** existing entity/concept pages are **appended to** (a + section attributed to the new source), never overwritten. The source page is + the canonical summary of that exact source and is (re)written. +- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`, + `pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`, + `status`, plus `maturity` on created pages) — exactly what `run-ingest.sh` + validates. + +The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does +not self-report it. No `run_id`, branch, commit or PR is invented here — those +belong to phase 2. + +> Interactive use of `pi` (TUI) is unaffected and still available for manual +> exploration. The **automated** ingest path no longer relies on `pi` or on +> native tool-calling: it is the single schema-constrained call above. diff --git a/skills/ingest/scripts/ingest-semantic.py b/skills/ingest/scripts/ingest-semantic.py new file mode 100644 index 0000000..337d9f2 --- /dev/null +++ b/skills/ingest/scripts/ingest-semantic.py @@ -0,0 +1,277 @@ +#!/usr/bin/env python3 +# ============================================================================= +# skills/ingest/scripts/ingest-semantic.py +# Phase 1 (semantic) of the Knowledge Genome ingest — the LIGHT version. +# +# The model does ONLY semantic extraction and returns ONE schema-constrained JSON +# object (no tools, no file writing, no git, no frontmatter, no slugs). This script +# then CONFORMS that output deterministically into wiki pages with enforced +# frontmatter + kebab-case paths, and writes a .ingest-manifest.json in EXACTLY the +# schema run-ingest.sh expects. run-ingest.sh (phase 2) then does index / log / +# scoped-lint / PR, unchanged. +# +# cd +# ingest-semantic.py raw/articles/.md # phase 1 (this) +# run-ingest.sh # phase 2 (deterministic) +# +# Why this shape: local tool-calling via pi/ollama proved fragile, and a small +# model does not reliably honour folders / naming / frontmatter / manifest schema +# when it writes files itself. Here the model cannot break the contract because it +# never touches the filesystem — the script owns all structure. Stdlib only. +# +# Emits a single JSON status line on stdout (for n8n / logs). +# ============================================================================= +import json, os, re, sys, datetime, urllib.request, urllib.error + +# --- config (override via env; these live in ~/.config/knowledge-genome.env) --- +OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/chat") +MODEL = os.environ.get("INGEST_MODEL", "qwen2.5:14b") +NUM_CTX = int(os.environ.get("INGEST_NUM_CTX", "16384")) +TIMEOUT = int(os.environ.get("INGEST_TIMEOUT", "600")) +TODAY = datetime.date.today().isoformat() + + +def die(stage, reason): + print(json.dumps({"status": "error", "stage": stage, "reason": reason})) + sys.exit(1) + + +# --- args + pre-flight (mirror the old skill's guards, enforced in code) --- +if len(sys.argv) < 3: + die("args", "usage: ingest-semantic.py ") +genome = sys.argv[1] +raw_rel = sys.argv[2].lstrip("./") + +if "private/" in raw_rel or raw_rel.startswith("private"): + die("preflight", "refusing private source: " + raw_rel) +if os.environ.get("PRIVATE_CONTEXT", "disabled") != "disabled": + die("preflight", "PRIVATE_CONTEXT must be disabled") +if not raw_rel.startswith("raw/"): + die("preflight", "source must live under raw/: " + raw_rel) +if not os.path.isfile(raw_rel): + die("preflight", "source not found in cwd: " + raw_rel) + +with open(raw_rel, "r", encoding="utf-8") as fh: + source_text = fh.read() +if not source_text.strip(): + die("preflight", "source is empty: " + raw_rel) + + +# --- the semantic contract (authoritative copy; SKILL.md documents it) --- +SYSTEM_PROMPT = """You perform the SEMANTIC PASS of a single source into a knowledge wiki. +Read the source and return ONLY structured data describing what it contains. +You do not write files, you do not produce frontmatter, and you do not invent +paths, slugs, branches, commits or PRs — a deterministic script does all of that. + +Rules: +- source_summary: a faithful, self-contained summary of the source, in the + source's own language. Plain prose, no markdown headings. +- key_points: the handful of concrete facts/claims worth indexing. +- entities: every person, tool, organisation or product the source names. + kind is one of person|tool|org|product. description is one or two factual + sentences. No markdown headings inside the description. +- concepts: every pattern, theory, decision or named idea the source explains. + description is one or two factual sentences. +- contradictions: ONLY when the source makes a claim that directly contradicts a + widely-known fact or contradicts itself. Otherwise return an empty list. +- Names must be the natural name of the thing; the script will normalise them. +Do not pad. Be faithful to the source.""" + +# --- JSON schema -> constrained decoding (Ollama structured outputs) --- +SCHEMA = { + "type": "object", + "properties": { + "source_title": {"type": "string"}, + "source_summary": {"type": "string"}, + "key_points": {"type": "array", "items": {"type": "string"}}, + "entities": {"type": "array", "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "kind": {"type": "string", + "enum": ["person", "tool", "org", "product"]}, + "description": {"type": "string"}, + }, + "required": ["name", "description"], + }}, + "concepts": {"type": "array", "items": { + "type": "object", + "properties": { + "name": {"type": "string"}, + "description": {"type": "string"}, + }, + "required": ["name", "description"], + }}, + "contradictions": {"type": "array", "items": { + "type": "object", + "properties": { + "concept": {"type": "string"}, + "description": {"type": "string"}, + }, + "required": ["concept", "description"], + }}, + "reasoning": {"type": "string"}, + "pr_summary": {"type": "string"}, + }, + "required": ["source_title", "source_summary", "entities", "concepts"], +} + + +def call_model(): + payload = { + "model": MODEL, + "messages": [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": + "Source path: " + raw_rel + "\n\n--- SOURCE START ---\n" + + source_text + "\n--- SOURCE END ---\n\nReturn the JSON now."}, + ], + "format": SCHEMA, # schema-constrained generation + "stream": False, + # deterministic extraction; repetition penalties OFF for structured output + "options": {"temperature": 0.2, "repeat_penalty": 1.0, "num_ctx": NUM_CTX}, + } + data = json.dumps(payload).encode("utf-8") + req = urllib.request.Request( + OLLAMA_URL, data=data, headers={"Content-Type": "application/json"}) + try: + with urllib.request.urlopen(req, timeout=TIMEOUT) as r: + resp = json.loads(r.read().decode("utf-8")) + except urllib.error.URLError as e: + die("model", "ollama request failed: " + str(e)) + content = ((resp.get("message") or {}).get("content") or "").strip() + # schema-constrained, but stay defensive if a model wraps it in a fence + if content.startswith("```"): + content = content.strip("`") + brace = content.find("{") + if brace >= 0: + content = content[brace:] + try: + return json.loads(content) + except json.JSONDecodeError as e: + die("model", "model did not return valid JSON: " + str(e)) + + +# --- conform helpers (the script OWNS all structure) --- +def slugify(s): + s = re.sub(r"[^a-z0-9]+", "-", (s or "").strip().lower()) + return re.sub(r"-+", "-", s).strip("-") or "untitled" + + +def twords(s, n=12): + s = " ".join((s or "").split()) + w = s.split(" ") + return s if len(w) <= n else " ".join(w[:n]) + "…" + + +def frontmatter(ptype, tags): + taglist = "[" + ", ".join(sorted(set(t for t in tags if t))) + "]" + return ("---\n" + f"type: {ptype}\n" + f"domain: {genome}\n" + "maturity: draft\n" + f"last_updated: {TODAY}\n" + "private: false\n" + f"tags: {taglist}\n" + "---\n") + + +def write_new(path, ptype, title, body, tags): + os.makedirs(os.path.dirname(path), exist_ok=True) + with open(path, "w", encoding="utf-8") as f: + f.write(frontmatter(ptype, tags)) + f.write(f"\n# {title}\n\n{body}\n") + + +def append_section(path, source_slug, body): + # never overwrite an existing page: accumulate, attributed to the new source + with open(path, "a", encoding="utf-8") as f: + f.write(f"\n\n## From [[sources/{source_slug}]]\n\n{body}\n") + try: # best-effort bump of last_updated in the existing frontmatter + with open(path, "r", encoding="utf-8") as f: + txt = f.read() + txt = re.sub(r"(?m)^last_updated:.*$", "last_updated: " + TODAY, txt, count=1) + with open(path, "w", encoding="utf-8") as f: + f.write(txt) + except Exception: + pass + + +# --- run the semantic pass --- +sem = call_model() +source_slug = slugify(os.path.splitext(os.path.basename(raw_rel))[0]) +pages = [] + +# 1. source page — canonical summary of THIS source (re)written +src_path = f"wiki/sources/{source_slug}.md" +src_status = "modified" if os.path.exists(src_path) else "created" +kp_lines = "\n".join("- " + p for p in (sem.get("key_points") or []) if p.strip()) +src_body = (sem.get("source_summary") or "").strip() +if kp_lines: + src_body += "\n\n## Key points\n\n" + kp_lines +src_body += f"\n\n## Source\n\n- [[{raw_rel}]]\n" +src_tags = ([slugify(e.get("name", "")) for e in sem.get("entities", [])] + + [slugify(c.get("name", "")) for c in sem.get("concepts", [])])[:8] +os.makedirs("wiki/sources", exist_ok=True) +with open(src_path, "w", encoding="utf-8") as f: + f.write(frontmatter("source", src_tags)) + f.write(f"\n# {sem.get('source_title') or source_slug}\n\n{src_body}\n") +pages.append({"path": src_path, + "summary": twords(sem.get("source_title") or source_slug), + "maturity": "draft", "status": src_status}) + + +def handle(kind_dir, ptype, items): + for it in items or []: + name = (it.get("name") or "").strip() + if not name: + continue + slug = slugify(name) + path = f"wiki/{kind_dir}/{slug}.md" + desc = (it.get("description") or "").strip() + if os.path.exists(path): + append_section(path, source_slug, desc) + pages.append({"path": path, "summary": twords(desc), "status": "modified"}) + else: + body = desc + f"\n\n## Sources\n\n- [[sources/{source_slug}]]\n" + write_new(path, ptype, name, body, [genome, ptype]) + pages.append({"path": path, "summary": twords(desc), + "maturity": "draft", "status": "created"}) + + +# 2. entities, 3. concepts +handle("entities", "entity", sem.get("entities", [])) +handle("concepts", "concept", sem.get("concepts", [])) + +# 4. contradictions -> conflict pages (run-ingest routes wiki/queries/conflict-*) +conflicts = sem.get("contradictions") or [] +conf_slugs = [] +for c in conflicts: + cslug = slugify(c.get("concept", "unknown")) + conf_slugs.append(cslug) + path = f"wiki/queries/conflict-{cslug}-{TODAY}.md" + write_new(path, "query", f"Conflict: {c.get('concept', '')}", + (c.get("description") or "").strip() + + f"\n\n## Source\n\n- [[sources/{source_slug}]]\n", + [genome, "conflict"]) + pages.append({"path": path, "summary": "", "maturity": "draft", + "status": "created"}) + +contradictions_str = ("None" if not conflicts + else f"{len(conflicts)} conflict file(s) created — " + + ", ".join(conf_slugs)) + +# --- write the manifest in EXACTLY run-ingest.sh's schema --- +manifest = { + "raw_source": raw_rel, + "reasoning": sem.get("reasoning") or ("Ingest of " + raw_rel), + "pr_summary": sem.get("pr_summary") or ("Semantic ingest of " + raw_rel), + "contradictions": contradictions_str, + "pages": pages, +} +with open(".ingest-manifest.json", "w", encoding="utf-8") as f: + json.dump(manifest, f, indent=2, ensure_ascii=False) + +print(json.dumps({"status": "ok", "stage": "semantic", + "pages": len(pages), "model": MODEL, + "manifest": ".ingest-manifest.json"}))