Compare commits

..

No commits in common. "e396bc93e25111121bf115fc62c96cb67b3f6d3f" and "d207a0fc91afa0bd7a266d69f2486bd1fe4844dd" have entirely different histories.

3 changed files with 73 additions and 360 deletions

View file

@ -1,92 +1,93 @@
--- ---
name: ingest name: ingest
description: Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR. description: Semantic pass of a single raw source into the current genome's wiki — read the source, write sources/entities/concepts, handle contradictions, then emit a manifest and STOP. Use when a new file lands in raw/. Does NOT do git, log, index, lint, or PRs (a post-processor handles those), and does NOT handle private sources or project repos.
license: see repository license: see repository
compatibility: Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled. compatibility: Runs inside one genome checkout (cwd = genome root). Tools needed — read, edit only. NO bash, NO git. The deterministic steps (index, log, scoped lint, PR) run AFTER you exit, via run-ingest.sh. PRIVATE_CONTEXT must be disabled.
allowed-tools: read edit
metadata: metadata:
framework: knowledge-genome framework: knowledge-genome
phase: "1-ingest-semantic" phase: "1-ingest-semantic"
mode: structured-json # lightweight agent + deterministic conform
--- ---
# Ingest — semantic pass (structured-JSON) # Ingest — semantic pass
This is the **light** semantic pass. The model's only job is to read one source You run inside ONE genome checkout. `AGENTS.md` (already in your context) is the
and return a single JSON object describing what the source contains. It does authoritative contract. Your job is the **semantic pass only**: read the source, write
**not** write files, choose paths, produce frontmatter, pick slugs, or touch the wiki pages, handle contradictions. You do **not** touch git, the log, the index, the
git / index / log / PRs. All structure is owned by `scripts/ingest-semantic.py`, linter, or PRs — a post-processor (`run-ingest.sh`) does all of that _after you stop_,
which conforms the model's JSON into wiki pages with enforced kebab-case paths from the manifest you leave behind. This keeps your context clean and your turns few,
and frontmatter, and writes `.ingest-manifest.json` in the exact schema which matters on a small local model.
`run-ingest.sh` consumes. This keeps the agent minimal and makes the output
impossible to mis-shape, regardless of how small or quirky the local model is.
Pipeline: **Argument:** the relative path of the single raw source to ingest
(e.g. `raw/articles/foo.md`). Process only this one.
cd <genome checkout> ## Pre-flight — stop the session if any check fails
scripts/ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
scripts/run-ingest.sh <genome> # phase 2 (deterministic)
## Pre-flight (enforced by ingest-semantic.py, not by the model) 1. Refuse if the argument path is under any `private/` directory.
1. Refuse if the source path is under any `private/` directory.
2. Refuse if `PRIVATE_CONTEXT` is not `disabled`. 2. Refuse if `PRIVATE_CONTEXT` is not `disabled`.
3. Confirm the file exists under `raw/` and is non-empty. 3. Confirm the file exists under `raw/`.
## What the model returns (the only contract) ## Semantic work (your only job)
A single JSON object, decoding-constrained to this shape via Ollama's `format`: 1. Read the source once.
2. Write `wiki/sources/<kebab-slug>.md` — faithful summary + key points, with the required
frontmatter (`type: source`, `domain: <genome>`, `maturity: draft`,
`last_updated: <today>`, `private: false`, sensible `tags`).
3. For each entity (person, tool, org) → create or update `wiki/entities/<kebab-name>.md`.
4. For each concept (pattern, theory, decision) → create or update
`wiki/concepts/<kebab-name>.md`.
5. On a real contradiction with an existing claim, follow `AGENTS.md` §Conflict: create
`wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`. Never overwrite the existing page.
**Naming — you are the sole author of these names; nothing renames your files.** Use
minimal kebab-case: lowercase letters, digits and hyphens only — no spaces, no underscores,
no capitals. Pick stable names so the same entity is never created twice (always `acme`,
never also `acme-corp`). The path you write a file to MUST be byte-for-byte the path you
list in the manifest.
**Deciding create-vs-update and spotting contradictions — mind the context budget.** Use
`wiki/index.md` to locate existing pages, then read **only** the handful that _this source
actually names_ — the entities and concepts in the source's title and opening paragraphs —
not everything the index lists. When in doubt, read fewer: a missed cross-link is far
cheaper than a saturated context. Never scan whole directories.
## Finish: write the manifest, then STOP
As your **final action**, write `.ingest-manifest.json` at the genome root
(NOT under `wiki/`) describing exactly what you did. Then stop — do not commit, lint,
append to the log/index, or open anything.
```json ```json
{ {
"source_title": "Human title of the source", "raw_source": "raw/articles/foo.md",
"source_summary": "Faithful, self-contained prose summary of the source.", "reasoning": "One sentence for the log: what changed and why.",
"key_points": ["Concrete fact or claim worth indexing", "..."], "pr_summary": "One or two sentences describing this ingest for the PR.",
"entities": [ "contradictions": "None (or: 1 conflict file created — <concept>)",
{ "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." } "pages": [
], {
"concepts": [ "path": "wiki/sources/foo.md",
{ "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." } "summary": "One-line index summary.",
], "maturity": "draft",
"contradictions": [ "status": "created"
{ "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." } },
], {
"reasoning": "One sentence for the log: what this source adds.", "path": "wiki/entities/acme.md",
"pr_summary": "One or two sentences describing this ingest for the PR." "summary": "Acme — vendor.",
"status": "modified"
}
]
} }
``` ```
Field rules (guidance for the model; the script enforces _structure_): Manifest rules:
- `source_summary` is faithful and in the source's own language. No markdown - List every page you created or modified, with `status` `created` or `modified`.
headings inside any description field. No padding. - `summary` is the one-line index description (≈12 words max). For conflict pages the
- `entities` = every person, tool, org or product the source names. `kind` summary is ignored — the index lists conflicts by slug only.
`person|tool|org|product`. `description` = one or two factual sentences. - `maturity` is required only on `created` pages (it seeds the new index entry). It is
- `concepts` = every pattern, theory, decision or named idea the source explains. ignored for `modified` pages, so omit it there.
- `contradictions` = only a claim that directly contradicts a widely-known fact - Do NOT add a `model` field — the orchestrator records which model produced this run; you
or contradicts the source itself; otherwise an empty list. cannot know your own model name reliably, so do not guess one.
- Names are the natural name of the thing. The script normalises them to - Do not invent a `run_id`, branch, commit, or PR — those belong to the post-processor.
kebab-case and guarantees a single stable page per entity/concept.
## What the conform script guarantees (so the model cannot break it) One source per session. After writing the manifest, stop.
- **Paths:** `wiki/sources/<slug>.md`, `wiki/entities/<slug>.md`,
`wiki/concepts/<slug>.md`, `wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md`.
- **Slugs:** minimal kebab-case (lowercase, digits, hyphens; no spaces /
underscores / capitals).
- **Frontmatter:** `type`, `domain: <genome>`, `maturity: draft`,
`last_updated: <today>`, `private: false`, `tags`.
- **Create-vs-update:** existing entity/concept pages are **appended to** (a
section attributed to the new source), never overwritten. The source page is
the canonical summary of that exact source and is (re)written.
- **Manifest:** `.ingest-manifest.json` with `raw_source`, `reasoning`,
`pr_summary`, `contradictions` (string), and `pages[]` (`path`, `summary`,
`status`, plus `maturity` on created pages) — exactly what `run-ingest.sh`
validates.
The model name is recorded by the orchestrator (`INGEST_MODEL`); the model does
not self-report it. No `run_id`, branch, commit or PR is invented here — those
belong to phase 2.
> Interactive use of `pi` (TUI) is unaffected and still available for manual
> exploration. The **automated** ingest path no longer relies on `pi` or on
> native tool-calling: it is the single schema-constrained call above.

View file

@ -1,277 +0,0 @@
#!/usr/bin/env python3
# =============================================================================
# skills/ingest/scripts/ingest-semantic.py
# Phase 1 (semantic) of the Knowledge Genome ingest — the LIGHT version.
#
# The model does ONLY semantic extraction and returns ONE schema-constrained JSON
# object (no tools, no file writing, no git, no frontmatter, no slugs). This script
# then CONFORMS that output deterministically into wiki pages with enforced
# frontmatter + kebab-case paths, and writes a .ingest-manifest.json in EXACTLY the
# schema run-ingest.sh expects. run-ingest.sh (phase 2) then does index / log /
# scoped-lint / PR, unchanged.
#
# cd <genome checkout>
# ingest-semantic.py <genome> raw/articles/<file>.md # phase 1 (this)
# run-ingest.sh <genome> # phase 2 (deterministic)
#
# Why this shape: local tool-calling via pi/ollama proved fragile, and a small
# model does not reliably honour folders / naming / frontmatter / manifest schema
# when it writes files itself. Here the model cannot break the contract because it
# never touches the filesystem — the script owns all structure. Stdlib only.
#
# Emits a single JSON status line on stdout (for n8n / logs).
# =============================================================================
import json, os, re, sys, datetime, urllib.request, urllib.error
# --- config (override via env; these live in ~/.config/knowledge-genome.env) ---
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/chat")
MODEL = os.environ.get("INGEST_MODEL", "qwen2.5:14b")
NUM_CTX = int(os.environ.get("INGEST_NUM_CTX", "16384"))
TIMEOUT = int(os.environ.get("INGEST_TIMEOUT", "600"))
TODAY = datetime.date.today().isoformat()
def die(stage, reason):
print(json.dumps({"status": "error", "stage": stage, "reason": reason}))
sys.exit(1)
# --- args + pre-flight (mirror the old skill's guards, enforced in code) ---
if len(sys.argv) < 3:
die("args", "usage: ingest-semantic.py <genome> <raw/rel/path.md>")
genome = sys.argv[1]
raw_rel = sys.argv[2].lstrip("./")
if "private/" in raw_rel or raw_rel.startswith("private"):
die("preflight", "refusing private source: " + raw_rel)
if os.environ.get("PRIVATE_CONTEXT", "disabled") != "disabled":
die("preflight", "PRIVATE_CONTEXT must be disabled")
if not raw_rel.startswith("raw/"):
die("preflight", "source must live under raw/: " + raw_rel)
if not os.path.isfile(raw_rel):
die("preflight", "source not found in cwd: " + raw_rel)
with open(raw_rel, "r", encoding="utf-8") as fh:
source_text = fh.read()
if not source_text.strip():
die("preflight", "source is empty: " + raw_rel)
# --- the semantic contract (authoritative copy; SKILL.md documents it) ---
SYSTEM_PROMPT = """You perform the SEMANTIC PASS of a single source into a knowledge wiki.
Read the source and return ONLY structured data describing what it contains.
You do not write files, you do not produce frontmatter, and you do not invent
paths, slugs, branches, commits or PRs a deterministic script does all of that.
Rules:
- source_summary: a faithful, self-contained summary of the source, in the
source's own language. Plain prose, no markdown headings.
- key_points: the handful of concrete facts/claims worth indexing.
- entities: every person, tool, organisation or product the source names.
kind is one of person|tool|org|product. description is one or two factual
sentences. No markdown headings inside the description.
- concepts: every pattern, theory, decision or named idea the source explains.
description is one or two factual sentences.
- contradictions: ONLY when the source makes a claim that directly contradicts a
widely-known fact or contradicts itself. Otherwise return an empty list.
- Names must be the natural name of the thing; the script will normalise them.
Do not pad. Be faithful to the source."""
# --- JSON schema -> constrained decoding (Ollama structured outputs) ---
SCHEMA = {
"type": "object",
"properties": {
"source_title": {"type": "string"},
"source_summary": {"type": "string"},
"key_points": {"type": "array", "items": {"type": "string"}},
"entities": {"type": "array", "items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"kind": {"type": "string",
"enum": ["person", "tool", "org", "product"]},
"description": {"type": "string"},
},
"required": ["name", "description"],
}},
"concepts": {"type": "array", "items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
},
"required": ["name", "description"],
}},
"contradictions": {"type": "array", "items": {
"type": "object",
"properties": {
"concept": {"type": "string"},
"description": {"type": "string"},
},
"required": ["concept", "description"],
}},
"reasoning": {"type": "string"},
"pr_summary": {"type": "string"},
},
"required": ["source_title", "source_summary", "entities", "concepts"],
}
def call_model():
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content":
"Source path: " + raw_rel + "\n\n--- SOURCE START ---\n"
+ source_text + "\n--- SOURCE END ---\n\nReturn the JSON now."},
],
"format": SCHEMA, # schema-constrained generation
"stream": False,
# deterministic extraction; repetition penalties OFF for structured output
"options": {"temperature": 0.2, "repeat_penalty": 1.0, "num_ctx": NUM_CTX},
}
data = json.dumps(payload).encode("utf-8")
req = urllib.request.Request(
OLLAMA_URL, data=data, headers={"Content-Type": "application/json"})
try:
with urllib.request.urlopen(req, timeout=TIMEOUT) as r:
resp = json.loads(r.read().decode("utf-8"))
except urllib.error.URLError as e:
die("model", "ollama request failed: " + str(e))
content = ((resp.get("message") or {}).get("content") or "").strip()
# schema-constrained, but stay defensive if a model wraps it in a fence
if content.startswith("```"):
content = content.strip("`")
brace = content.find("{")
if brace >= 0:
content = content[brace:]
try:
return json.loads(content)
except json.JSONDecodeError as e:
die("model", "model did not return valid JSON: " + str(e))
# --- conform helpers (the script OWNS all structure) ---
def slugify(s):
s = re.sub(r"[^a-z0-9]+", "-", (s or "").strip().lower())
return re.sub(r"-+", "-", s).strip("-") or "untitled"
def twords(s, n=12):
s = " ".join((s or "").split())
w = s.split(" ")
return s if len(w) <= n else " ".join(w[:n]) + ""
def frontmatter(ptype, tags):
taglist = "[" + ", ".join(sorted(set(t for t in tags if t))) + "]"
return ("---\n"
f"type: {ptype}\n"
f"domain: {genome}\n"
"maturity: draft\n"
f"last_updated: {TODAY}\n"
"private: false\n"
f"tags: {taglist}\n"
"---\n")
def write_new(path, ptype, title, body, tags):
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
f.write(frontmatter(ptype, tags))
f.write(f"\n# {title}\n\n{body}\n")
def append_section(path, source_slug, body):
# never overwrite an existing page: accumulate, attributed to the new source
with open(path, "a", encoding="utf-8") as f:
f.write(f"\n\n## From [[sources/{source_slug}]]\n\n{body}\n")
try: # best-effort bump of last_updated in the existing frontmatter
with open(path, "r", encoding="utf-8") as f:
txt = f.read()
txt = re.sub(r"(?m)^last_updated:.*$", "last_updated: " + TODAY, txt, count=1)
with open(path, "w", encoding="utf-8") as f:
f.write(txt)
except Exception:
pass
# --- run the semantic pass ---
sem = call_model()
source_slug = slugify(os.path.splitext(os.path.basename(raw_rel))[0])
pages = []
# 1. source page — canonical summary of THIS source (re)written
src_path = f"wiki/sources/{source_slug}.md"
src_status = "modified" if os.path.exists(src_path) else "created"
kp_lines = "\n".join("- " + p for p in (sem.get("key_points") or []) if p.strip())
src_body = (sem.get("source_summary") or "").strip()
if kp_lines:
src_body += "\n\n## Key points\n\n" + kp_lines
src_body += f"\n\n## Source\n\n- [[{raw_rel}]]\n"
src_tags = ([slugify(e.get("name", "")) for e in sem.get("entities", [])]
+ [slugify(c.get("name", "")) for c in sem.get("concepts", [])])[:8]
os.makedirs("wiki/sources", exist_ok=True)
with open(src_path, "w", encoding="utf-8") as f:
f.write(frontmatter("source", src_tags))
f.write(f"\n# {sem.get('source_title') or source_slug}\n\n{src_body}\n")
pages.append({"path": src_path,
"summary": twords(sem.get("source_title") or source_slug),
"maturity": "draft", "status": src_status})
def handle(kind_dir, ptype, items):
for it in items or []:
name = (it.get("name") or "").strip()
if not name:
continue
slug = slugify(name)
path = f"wiki/{kind_dir}/{slug}.md"
desc = (it.get("description") or "").strip()
if os.path.exists(path):
append_section(path, source_slug, desc)
pages.append({"path": path, "summary": twords(desc), "status": "modified"})
else:
body = desc + f"\n\n## Sources\n\n- [[sources/{source_slug}]]\n"
write_new(path, ptype, name, body, [genome, ptype])
pages.append({"path": path, "summary": twords(desc),
"maturity": "draft", "status": "created"})
# 2. entities, 3. concepts
handle("entities", "entity", sem.get("entities", []))
handle("concepts", "concept", sem.get("concepts", []))
# 4. contradictions -> conflict pages (run-ingest routes wiki/queries/conflict-*)
conflicts = sem.get("contradictions") or []
conf_slugs = []
for c in conflicts:
cslug = slugify(c.get("concept", "unknown"))
conf_slugs.append(cslug)
path = f"wiki/queries/conflict-{cslug}-{TODAY}.md"
write_new(path, "query", f"Conflict: {c.get('concept', '')}",
(c.get("description") or "").strip()
+ f"\n\n## Source\n\n- [[sources/{source_slug}]]\n",
[genome, "conflict"])
pages.append({"path": path, "summary": "", "maturity": "draft",
"status": "created"})
contradictions_str = ("None" if not conflicts
else f"{len(conflicts)} conflict file(s) created — "
+ ", ".join(conf_slugs))
# --- write the manifest in EXACTLY run-ingest.sh's schema ---
manifest = {
"raw_source": raw_rel,
"reasoning": sem.get("reasoning") or ("Ingest of " + raw_rel),
"pr_summary": sem.get("pr_summary") or ("Semantic ingest of " + raw_rel),
"contradictions": contradictions_str,
"pages": pages,
}
with open(".ingest-manifest.json", "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2, ensure_ascii=False)
print(json.dumps({"status": "ok", "stage": "semantic",
"pages": len(pages), "model": MODEL,
"manifest": ".ingest-manifest.json"}))

View file

@ -1,17 +1,13 @@
#!/usr/bin/env bash #!/usr/bin/env bash
# ============================================================================= # =============================================================================
# skills/ingest/scripts/run-ingest.sh # skills/ingest/scripts/run-ingest.sh
# Post-semantic orchestrator. Runs OUTSIDE the model, on vm101, in the genome # Post-pi orchestrator. Runs OUTSIDE pi's loop, on vm101, in the genome checkout.
# checkout. Consumes .ingest-manifest.json (written by ingest-semantic.py) and # Consumes .ingest-manifest.json (written by the ingest skill) and performs every
# performs every deterministic step — index, log, scoped lint, PR. # deterministic step — index, log, scoped lint, PR — so pi's context stays clean.
# #
# run-ingest.sh <genome_name> [manifest_path] # run-ingest.sh <genome_name> [manifest_path]
# #
# Emits a single JSON result line on stdout for n8n to parse. # Emits a single JSON result line on stdout for n8n to parse.
#
# every page listed in the manifest must exist on disk before we trust the run.
# Everything else is unchanged: the manifest the semantic phase now produces is
# already in this script's expected schema.
# ============================================================================= # =============================================================================
set -euo pipefail set -euo pipefail
@ -61,13 +57,6 @@ mapfile -t modified_paths < <(jq -r '.pages[] | select(.status=="modified") | .p
all_paths=( "${created_paths[@]}" "${modified_paths[@]}" ) all_paths=( "${created_paths[@]}" "${modified_paths[@]}" )
[[ ${#all_paths[@]} -gt 0 ]] || fail "manifest" "no pages reported" [[ ${#all_paths[@]} -gt 0 ]] || fail "manifest" "no pages reported"
# --- the semantic phase (ingest-semantic.py) writes the files; verify
# every manifest page actually exists on disk before trusting the run. Catches any
# drift between what the manifest claims and what was really written. ---
for _p in "${all_paths[@]}"; do
[[ -f "$_p" ]] || fail "pages" "manifest lists a file not present on disk: ${_p}"
done
conflict_label="" conflict_label=""
# NOTE: No rollback. The steps below modify the working tree in order (index → log → commit). # NOTE: No rollback. The steps below modify the working tree in order (index → log → commit).