knowledge-genome-orchestrator/skills/ingest/SKILL.md

4.7 KiB

name description license compatibility metadata
ingest Semantic pass of a single raw source into the current genome's wiki. The model ONLY extracts structured semantic content (summary, entities, concepts, contradictions) and returns one JSON object — it does not write files, produce frontmatter, slugs, git, index, log or PRs. A deterministic conform script (ingest-semantic.py) turns that JSON into properly-structured wiki pages + a manifest; run-ingest.sh then does index/log/lint/PR. see repository Driven by scripts/ingest-semantic.py (one schema-constrained call to a local model via Ollama /api/chat). NO agent tools are used — no read, no edit, no bash. The model never touches the filesystem. PRIVATE_CONTEXT must be disabled.
framework phase mode
knowledge-genome 1-ingest-semantic structured-json

Ingest — semantic pass (structured-JSON)

This is the light semantic pass. The model's only job is to read one source and return a single JSON object describing what the source contains. It does not write files, choose paths, produce frontmatter, pick slugs, or touch git / index / log / PRs. All structure is owned by scripts/ingest-semantic.py, which conforms the model's JSON into wiki pages with enforced kebab-case paths and frontmatter, and writes .ingest-manifest.json in the exact schema run-ingest.sh consumes. This keeps the agent minimal and makes the output impossible to mis-shape, regardless of how small or quirky the local model is.

Pipeline:

cd <genome checkout>
scripts/ingest-semantic.py <genome> raw/articles/<file>.md   # phase 1 (this)
scripts/run-ingest.sh      <genome>                          # phase 2 (deterministic)

Pre-flight (enforced by ingest-semantic.py, not by the model)

  1. Refuse if the source path is under any private/ directory.
  2. Refuse if PRIVATE_CONTEXT is not disabled.
  3. Confirm the file exists under raw/ and is non-empty.

What the model returns (the only contract)

A single JSON object, decoding-constrained to this shape via Ollama's format:

{
  "source_title": "Human title of the source",
  "source_summary": "Faithful, self-contained prose summary of the source.",
  "key_points": ["Concrete fact or claim worth indexing", "..."],
  "entities": [
    { "name": "Acme", "kind": "org", "description": "Vendor referenced by the source." }
  ],
  "concepts": [
    { "name": "JWT RS256", "description": "Asymmetric token signing scheme the source uses." }
  ],
  "contradictions": [
    { "concept": "auth", "description": "Source claims X, contradicting the existing claim Y." }
  ],
  "reasoning": "One sentence for the log: what this source adds.",
  "pr_summary": "One or two sentences describing this ingest for the PR."
}

Field rules (guidance for the model; the script enforces structure):

  • source_summary is faithful and in the source's own language. No markdown headings inside any description field. No padding.
  • entities = every person, tool, org or product the source names. kindperson|tool|org|product. description = one or two factual sentences.
  • concepts = every pattern, theory, decision or named idea the source explains.
  • contradictions = only a claim that directly contradicts a widely-known fact or contradicts the source itself; otherwise an empty list.
  • Names are the natural name of the thing. The script normalises them to kebab-case and guarantees a single stable page per entity/concept.

What the conform script guarantees (so the model cannot break it)

  • Paths: wiki/sources/<slug>.md, wiki/entities/<slug>.md, wiki/concepts/<slug>.md, wiki/queries/conflict-<concept>-<YYYY-MM-DD>.md.
  • Slugs: minimal kebab-case (lowercase, digits, hyphens; no spaces / underscores / capitals).
  • Frontmatter: type, domain: <genome>, maturity: draft, last_updated: <today>, private: false, tags.
  • Create-vs-update: existing entity/concept pages are appended to (a section attributed to the new source), never overwritten. The source page is the canonical summary of that exact source and is (re)written.
  • Manifest: .ingest-manifest.json with raw_source, reasoning, pr_summary, contradictions (string), and pages[] (path, summary, status, plus maturity on created pages) — exactly what run-ingest.sh validates.

The model name is recorded by the orchestrator (INGEST_MODEL); the model does not self-report it. No run_id, branch, commit or PR is invented here — those belong to phase 2.

Interactive use of pi (TUI) is unaffected and still available for manual exploration. The automated ingest path no longer relies on pi or on native tool-calling: it is the single schema-constrained call above.