# The AI Knowledge Layer for Policy and Political Work

## Kernel Ontology, Dynamic Extensions, and Ontology-Augmented Generation

**TACITUS Research · White Paper v2 · 2026**

---

## Abstract

Policy and political work is knowledge work over typed structure. Documents are the input, decisions are the output, but the load-bearing artefact in between — the structure that captures who claimed what, when, against which constraint, with what leverage, under which commitment, in which framing — is the part the institution has historically failed to persist. The dominant 2024–2026 stack for this task pairs an LLM with retrieval over unstructured text. We argue this stack is structurally inadequate along five measurable axes: temporal reasoning, causal precision, span-level provenance, contradiction handling, and long-horizon context.

This paper specifies an alternative. We define an eight-primitive kernel ontology — the Agentic Conflict Ontology (ACO), comprising Actor, Claim, Interest, Constraint, Leverage, Commitment, Event, Narrative — together with a discipline for per-case extensions that inherit from and validate against the kernel. We formalise the data model as a bi-temporal typed property graph: every node and edge carries valid-time and transaction-time intervals, every claim carries character-level source spans, and contradictions are represented as first-class edges rather than averaged into smoothed paraphrase. We define **Ontology-Augmented Generation (OAG)** as a generation pattern that produces output typed against the kernel, grounded in source spans, contested by construction, and temporally honest. We compare OAG to RAG, GraphRAG, OG-RAG, Think-on-Graph, and agentic-deep-graph-reasoning along the same axes, and specify the DIALECTICA engine that implements OAG as a seven-graph architecture with distinct update disciplines per graph.

The first specialisation we measure in public is conflict reasoning, through the TACITUS Conflict Grammar Corpus (TCGC), because it stresses every property of the data model simultaneously. The kernel is MIT-licensed; the pipeline is MIT-licensed; the benchmark is a shared artefact. This is a technical white paper, not a roadmap. Implementation details that are not yet open we mark as such.

---

## 1. Problem statement

Operational analysts in policy, mediation, regulatory, and political-affairs settings produce structured outputs (memos, briefings, options papers, agreements, compliance reviews) from unstructured inputs (cables, transcripts, statutes, reports, recorded interviews). The structure they produce is, in current institutional practice, *re-derived per analyst per case* from prose representations. Five concrete failures follow.

**F1 — Temporal flattening.** The output represents events as an ordered list, but the relationship between *when something happened* and *when the institution learned about it* is not preserved. A commitment made on Monday and a denial issued Thursday collapse into a single descriptive sentence. Benchmarks measuring LLM temporal reasoning report systematic failure when explicit dates are absent and order must be inferred from discourse cues (Chen et al., 2023, arXiv:2306.08952).

**F2 — Causal collapse.** Multi-hop inferences of the form *A caused B via mechanism M under condition C* are reduced to co-occurrence or simple succession. Kıcıman et al. (2023, arXiv:2305.00050) documented this systematically across the GPT-3.5/GPT-4 generation; the pattern persists into the Claude 4.x / GPT-5 / Gemini 3 generation when no external causal scaffold is provided.

**F3 — Provenance absence.** A claim emitted by a generic LLM is, architecturally, an uncited assertion. RAG reduces hallucination rates but operates at document granularity; the span from which a claim was extracted is not preserved through generation. Ji et al. (2023) survey the breadth of this problem; OG-RAG (Sharma, Kumar, Li, EMNLP 2025) demonstrates that ontology-grounded retrieval improves attribution speed by ~30% but does not produce span-level binding.

**F4 — Contradiction averaging.** When retrieved passages disagree, the generation step tends to smooth the disagreement into a single confident paraphrase. WikiContradict (Hou et al., NeurIPS 2024, arXiv:2406.13805) measured this across 253 human-annotated cases of equally trustworthy contradictory passages, finding all evaluated frontier models struggled to generate answers that accurately reflected the conflict, especially for implicit conflicts requiring reasoning. ConflictBank (Su et al., NeurIPS 2024, arXiv:2408.12076) scaled the analysis to 7.45M claim-evidence pairs across misinformation, temporal, and semantic conflict types, with consistent results across the Llama, Qwen, Mistral, and GPT families.

**F5 — Context decay under self-generation.** As agentic systems accumulate their own outputs across long, multi-step tasks, accuracy decays in ways that look fluent and read confident. Letta's Context-Bench (2025) puts the ceiling for strong 2025–2026 frontier models at ~74% on multi-step, contamination-proof context-engineering tasks. Anthropic's context-engineering work (2025) reframes the discipline accordingly: context is "a critical but finite resource," and the engineering question becomes how to configure it across long horizons. The corollary for institutional work: the chat transcript is not a sufficient persistence layer.

The five failures compound. A system that flattens time, collapses causality, drops provenance, averages contradictions, and degrades under its own outputs produces fluent prose that does not survive contact with a reviewer who has access to the underlying source material. The corrective is not larger models. The corrective is to put the typed structure outside the model, bind every generated claim to a source span, represent contradictions as first-class objects, and enforce the temporal model at the data layer rather than the prompt layer.

---

## 2. The eight primitives

The kernel of the Agentic Conflict Ontology (ACO) is a typed property graph schema with eight node primitives and a closed set of relation types. The kernel is intentionally small: closed ontologies of the SNOMED-CT / FIBO / LKIF class fail by maximalism (Héja, Surján, Varga, 2008; Schulz et al., 2023), schema-free agentic graph builders (Buehler 2025) fail by minimalism. Eight primitives is the smallest set we have found that preserves the structure the conflict-theory, policy-process, and mediation literatures converge on.

Each primitive is specified by (i) a tuple of required fields, (ii) admitted relation types to other primitives, (iii) kernel invariants any extension subclass must preserve, and (iv) the set of canonical task types in the TCGC benchmark that exercise it.

### 2.1 Actor

```
Actor := (id, label, type, role*, jurisdiction?, valid_time, transaction_time, source_spans*)
```

`type ∈ {Individual, Institution, State, Coalition, Organization, AdHocGroup}`. `role` is a multi-valued field drawn from `{Principal, Agent, VetoPlayer, BridgeActor, Mediator, ThirdParty, Observer}`. Actors admit relations `PARTICIPATES_IN(Event)`, `HOLDS(Leverage)`, `MAKES(Claim)`, `UNDERTAKES(Commitment)`, `FRAMES(Narrative)`, `REPRESENTS(Actor)`. Kernel invariant: every Actor instance must have at least one source span establishing the entity's presence in the corpus; pure inference of an Actor without evidentiary anchor is rejected at validation.

TCGC task: `actor-resolution` (F1 score against gold actor-graph; cross-document co-reference required).

### 2.2 Claim

```
Claim := (id, asserter:Actor, content, type, modality, valid_time, transaction_time,
          source_span, certainty, contestation_status)
```

`type ∈ {Factual, Evaluative, Normative, Position, Interest, RedLine, BATNA}`. `modality ∈ {Asserted, Implied, Reported, Denied, Withdrawn}`. The distinction between Claim and Commitment is enforced at the type system: a Claim is a speech act with propositional content; a Commitment is an undertaking with bi-temporal status transitions (§2.6). Separating Position (what the party demands) from Interest (what the party needs, §2.3) at the type level is what operationalises the Fisher–Ury distinction; in TCGC v0.1 this is the `position-interest-separation` task.

Kernel invariant: every Claim instance must point to a character-level source span; document-level provenance is insufficient. Contestation status is one of `{uncontested, contested, withdrawn, refuted}` and is updated atomically with any new edge of type `CONTRADICTS` pointing into the claim.

### 2.3 Interest

```
Interest := (id, holder:Actor, content, derivation, valid_time, transaction_time,
            source_spans*, confidence)
```

Interests are inferred, not directly asserted. The `derivation` field captures the inference chain: source claims, behavioural evidence, declared positions, and prior commitments that, in combination, ground the interest. Inter-annotator agreement on interest extraction is structurally lower than on actor resolution; TCGC reports per-task agreement transparently rather than collapsing it into a single headline. Kernel invariant: every Interest must specify its derivation; an Interest with no evidentiary chain is rejected at validation.

### 2.4 Constraint

```
Constraint := (id, type, content, source, scope, binding_strength, valid_time,
              transaction_time, source_span)
```

`type ∈ {Statutory, Regulatory, Mandate, Norm, Resource, Capacity, Jurisdictional, Procedural}`. `binding_strength ∈ {Hard, Soft, Aspirational}`. Constraints admit relations `LIMITS(Actor)`, `APPLIES_TO(Event)`, `BLOCKS(Commitment)`, `ENABLES(Leverage)`. The Constraint primitive is the negative space against which options are tested; in policy-options analysis it carries most of the load.

### 2.5 Leverage

```
Leverage := (id, holder:Actor, target:Actor, mechanism, magnitude, valid_time,
            transaction_time, source_spans*)
```

`mechanism ∈ {Resource, Information, Network, Procedural, Coercive, Normative, Reputational}`. Leverage is asymmetric by definition; the relation `HOLDS(Leverage)` is unidirectional from holder to target. Kernel invariant: the holder and target must be distinct Actor instances, and the mechanism must be substantiated by source evidence; abstract leverage assertions without mechanism are rejected.

### 2.6 Commitment

```
Commitment := (id, undertaker:Actor, beneficiary:Actor*, content, type,
              status, valid_time, transaction_time, source_spans*,
              revision_chain*)
```

`status ∈ {Active, Renegotiated, Contested, Broken, Fulfilled, Lapsed, Withdrawn}`. The Commitment primitive is where the bi-temporal model earns most of its load (§3). The `revision_chain` is an ordered list of prior status transitions with their transaction-time stamps and triggering claims. Kernel invariant: a Commitment cannot transition directly from `Active` to `Fulfilled` without intervening evidence; status transitions must be backed by source spans or by typed edges from Event instances.

Extension subclasses include `HR-Commitment`, `Ceasefire-Commitment`, `SLA-Commitment`, `RegulatoryCommitment`, `CampaignCommitment`. Each adds domain-specific fields (e.g., `HR-Commitment` adds `accommodation_type`, `effective_date`, `review_period`) while preserving the kernel invariants on status transitions.

### 2.7 Event

```
Event := (id, type, participants:Actor*, valid_time, transaction_time,
        causal_predecessors:Event*, mechanism?, conditions?,
        source_spans*, confidence)
```

Events form the temporal DAG underneath every case. The distinction the kernel enforces is between *temporal succession* (Event A precedes Event B) and *causal relation* (Event A caused Event B via mechanism M under condition C). The latter requires the `mechanism` and `conditions` fields to be populated; without them, the relation is recorded as `PRECEDES` rather than `CAUSES`. The TCGC `causal-chain` task isolates this distinction and reports correlation-vs-causation precision separately from extraction F1.

### 2.8 Narrative

```
Narrative := (id, framer:Actor, frame, target:{Event|Commitment|Claim|Actor}*,
            valid_time, transaction_time, source_spans*, drift_chain*)
```

Narrative is a first-class primitive because reframing is data. A re-frame from "incident" to "atrocity" or from "negotiating position" to "red line" is a measurable shift, often the most consequential in a case. The `drift_chain` records ordered re-framings of the same target by the same or different framers across time. TCGC task: `narrative-drift` (drift detection precision and recall, scored separately per framer).

### 2.9 Summary table

| # | Primitive | Required fields | Critical kernel invariant |
|---|-----------|-----------------|---------------------------|
| 01 | Actor | `id, label, type, source_spans*` | At least one source span |
| 02 | Claim | `asserter, content, type, source_span, valid_time` | Character-level span; atomic contestation update |
| 03 | Interest | `holder, content, derivation, confidence` | Derivation chain mandatory |
| 04 | Constraint | `type, content, source, binding_strength` | Source must be cited |
| 05 | Leverage | `holder, target, mechanism, magnitude` | Holder ≠ target; mechanism substantiated |
| 06 | Commitment | `undertaker, content, status, revision_chain` | Status transitions require evidence |
| 07 | Event | `type, participants, valid_time` | `PRECEDES` vs `CAUSES` distinction enforced |
| 08 | Narrative | `framer, frame, target, drift_chain` | Drift chain ordered, framer-scoped |

The same eight primitives are used across the seven domains the TCGC covers (workplace HR, commercial dispute, governance, peace process, policy/regulatory consultation, family/estate, state-to-state diplomatic friction). The empirical claim that the grammar travels is what the `cross-domain-primitive-transfer` task in TCGC v0.2 is designed to test.

---

## 3. The bi-temporal data model

Every node and edge in the ACO graph carries two time intervals.

```
valid_time := (t_valid_start, t_valid_end | ∞)
transaction_time := (t_tx_start, t_tx_end | ∞)
```

`valid_time` represents when the asserted fact holds (or held) in the world. `transaction_time` represents when the system recorded the fact. The pattern is standard in bi-temporal database design but is not preserved by default in LLM-driven graph construction. The Zep/Graphiti architecture (Rasmussen et al., 2025, arXiv:2501.13956) is the closest adjacent implementation; ACO extends the pattern across all eight primitives, not only entities.

### 3.1 Invalidation, not deletion

When a new edge contradicts an existing edge, the existing edge is *invalidated*, not deleted. Invalidation sets `t_tx_end` on the prior edge to the current transaction time and leaves the `valid_time` interval unchanged; a new edge is inserted with `t_tx_start` at the current transaction time and the new valid-time interval as asserted. Both edges remain queryable; queries scoped to `t_tx ≤ T` return the state of knowledge at time T, while queries scoped to `t_valid ∈ [T1, T2]` return what was held true during that valid-time window regardless of when the system learned about it.

### 3.2 Status transitions for Commitments

Commitments are the primitive on which the bi-temporal model carries the most load. A status transition is recorded as an ordered tuple:

```
StatusTransition := (commitment_id, from_status, to_status, t_valid, t_tx,
                    triggering_claim_id | triggering_event_id, source_span)
```

The `revision_chain` of a Commitment is the ordered sequence of its StatusTransitions. The kernel invariant `Active → Fulfilled` requires intervening evidence is enforced by validating that the triggering object is either an Event of admitted type or a Claim of `type ∈ {Asserted, Reported}` rather than `{Denied, Withdrawn}`.

### 3.3 Query semantics

The bi-temporal model supports four canonical query forms:

- **As-of-system-time**: `query @ t_tx = T` returns the graph state as the system knew it at time T.
- **As-of-valid-time**: `query @ t_valid = T` returns the graph state as it held in the world at time T, using current system knowledge.
- **As-of-both**: `query @ (t_tx = T1, t_valid = T2)` returns what the system at time T1 believed about the world at time T2. This is the form that supports retrospective analysis of analyst decisions.
- **Drift query**: `query drift(target_id)` returns the ordered sequence of valid-time intervals during which a node or edge was held true, including invalidations.

The query semantics are implemented in Cypher (Neo4j 5.x) with explicit interval predicates; the underlying schema is portable to any property-graph store that supports indexed temporal range queries.

### 3.4 Worked example

A regulatory agency issues a written statement at 09:00 Monday: *"The rule will be redrafted before Q3."* The OAG pipeline emits:

```
Commitment(id=c1, undertaker=Agency, content="redraft rule before Q3",
  status=Active, t_valid=[Mon 09:00, ∞), t_tx=[Mon 09:30, ∞),
  source_span=PressRelease.pdf:chars[1024:1198])
```

A trade body posts an ambiguous acknowledgement at 11:00 Monday. OAG emits a `Claim` from the trade body with relation `ACKNOWLEDGES_AMBIGUOUSLY` to `c1`. On Thursday at 14:00, the agency's spokesperson on a public panel says: *"The redraft is exploratory; nothing has been promised."* OAG emits:

```
Claim(id=cl2, asserter=AgencySpokesperson, content="redraft exploratory; nothing promised",
  modality=Denied, type=Factual, t_valid=[Thu 14:00, ∞), t_tx=[Thu 14:30, ∞),
  source_span=PanelTranscript.txt:chars[8421:8503])

Edge(type=CONTRADICTS, from=cl2, to=c1, t_tx=[Thu 14:30, ∞))

StatusTransition(commitment_id=c1, from=Active, to=Contested,
  t_valid=Thu 14:00, t_tx=Thu 14:30, triggering_claim_id=cl2,
  source_span=PanelTranscript.txt:chars[8421:8503])
```

A query at Friday 08:00 asking *"What did the agency commit to regarding the redraft?"* returns the Commitment `c1` with status `Contested`, the original Monday statement as the basis, the Thursday denial as the contradicting claim, and the full revision chain. A generic RAG system asked the same question would, in our internal evaluations, return a smoothed paraphrase. The structure is the output.

---

## 4. Kernel + dynamic extensions

The kernel ontology is fixed and versioned (semantic versioning; kernel breaking changes require major-version bumps and are governed by public review). Dynamic extensions are per-domain subclasses inheriting from kernel primitives. The discipline that makes this safe is the kernel invariant validation step in the ingestion pipeline.

### 4.1 Extension specification

An extension subclass is declared as:

```
Extension := (id, name, parent_primitive, added_fields*,
             added_relations*, kernel_invariants_inherited*,
             extension_invariants*, provenance, version)
```

Example: `HR-Commitment` extends `Commitment` with `accommodation_type`, `effective_date`, `review_period`, `escalation_path`. It inherits all kernel invariants of `Commitment` (status transitions require evidence; `Active → Fulfilled` requires intervening evidence) and adds extension invariants (`review_period > 0`; `escalation_path` references valid Actor IDs).

### 4.2 Validation pipeline

Every extension instance written to the graph passes through a four-stage validator:

1. **Schema conformance**: required fields present and well-typed.
2. **Kernel invariant validation**: parent primitive's invariants hold (the `kernel-invariant-validation` TCGC task in v0.2 is designed to measure how often LLM-driven extension induction violates these).
3. **Extension invariant validation**: subclass-specific invariants hold.
4. **Provenance binding**: every field that is not derived from a kernel field is bound to at least one source span.

A subclass that fails any stage is either rejected (in strict mode) or written to a quarantine subgraph for analyst review (in interactive mode). The quarantine subgraph is itself queryable, enabling diagnostic workflows for extension drift.

### 4.3 Closed-ontology versus schema-free failure modes

Closed ontologies (SNOMED-CT ~350,000 concepts; FIBO; LKIF; Gene Ontology) achieve interoperability at the cost of slow extension. Héja, Surján, and Varga's 2008 ontological analysis of SNOMED CT and Schulz et al.'s 2023 SNOMED–BFO convergence analysis document how decades of bottom-up evolution produced hierarchical errors that impede formal reasoning; Keet and Grütter (2021) propose systematic conflict-resolution frameworks specifically because ad hoc resolution at scale is intractable. For policy and political work the cost of a closed ontology is higher than for biomedicine: the categories themselves are partly what the parties are arguing about.

Schema-free agentic graph builders (Buehler 2025, arXiv:2502.13025, *J. Mater. Res.* 40, 2204) produce remarkable emergent structure — scale-free networks with hub formation and stable modularity over hundreds of reasoning iterations — but the resulting schema is per-corpus. Two analysts on the same file converge on two ontologies. Cross-case interoperability fails.

The kernel-with-extensions pattern sits between the two. The kernel is small enough that it does not require committee governance (eight primitives, ~50 relation types, ~30 kernel invariants); the extensions are flexible enough that domain transfer does not require kernel changes; the validation pipeline is what keeps extensions from drifting into mutually unintelligible dialects. The convergent pattern in adjacent fields supports this design:

- **OG-RAG** (Sharma, Kumar, Li, EMNLP 2025; arXiv:2412.15235) constructs a hypergraph representation in which each hyperedge clusters factual knowledge under a domain-specific ontology, reporting 55% increase in recall of accurate facts and 40% improvement in response correctness across four LLM backbones.
- **MedKGent** (Zhang et al., 2025; arXiv:2508.12393) constructs a temporally evolving medical KG over 10M PubMed abstracts with an Extractor agent (confidence scoring via sampling) and a Constructor agent (day-by-day integration with timestamp-aware conflict resolution); the resulting graph contains 156,275 entities and 2,971,384 triples, accuracy ~90%.
- **LLMs4OL 2025** (Babaei Giglou et al., eds., ISWC 2025) ran four ontology-learning tasks (Text2Onto, Term Typing, Taxonomy Discovery, Non-Taxonomic Relation Extraction); the consistent winning pattern was hybrid pipelines integrating commercial LLMs with domain-tuned embeddings, with specialised domain models winning inside biomedical and technical datasets.

The TCGC v0.2 task type `schema-extension-induction` measures the rate at which a system, presented with a novel domain corpus, can induce extension subclasses that (i) inherit correctly from kernel primitives, (ii) preserve kernel invariants, and (iii) maintain provenance binding. This is the explicit benchmark surface for the kernel-with-extensions design.

---

## 5. Ontology-Augmented Generation

Ontology-Augmented Generation (OAG) is a generation pattern in which a language model produces output that satisfies four properties simultaneously.

**Typed.** Every produced object is an instance of a kernel primitive or a validated extension subclass. The output is not free prose; it is a set of typed graph operations (`CREATE Actor`, `CREATE Claim`, `CREATE Edge`, `UPDATE Status`, `INVALIDATE Edge`) that the validator can check before commit.

**Grounded.** Every claim cites the source span it was extracted from. Spans are tracked at character level — `(document_id, char_start, char_end)` — not document level. The grounding pass is enforced at the validation stage: an emitted Claim without a span pointer is rejected.

**Contested.** Counter-claims, contradictions, and competing narratives are first-class objects in the graph the LLM reads from and writes to. When two retrieved spans disagree, the model is required to emit two Claim nodes plus a `CONTRADICTS` edge, not a single smoothed Claim. Validation rejects emissions that paper over contradicting source spans.

**Temporally honest.** Every claim carries `valid_time` and `transaction_time` stamps. The model cannot collapse a Monday claim and a Thursday revision into a single sentence: the validator detects two source spans with non-overlapping valid-time intervals and refuses to merge them.

### 5.1 Generation procedure

The procedure is a four-stage pipeline.

**Stage 1 — Retrieval.** Span-aware retrieval over the corpus returns ranked passages with character-level spans. Unlike RAG, retrieval is *typed*: the retriever knows which kernel primitive the downstream extraction is targeting and prioritises spans containing canonical surface forms (e.g., commitment-language verbs for `Commitment` extraction, framing-language for `Narrative` extraction).

**Stage 2 — Typed extraction.** The LLM is prompted with the kernel schema, the relevant extension subclasses for the case, and the retrieved spans. It emits a JSON Lines stream of typed graph operations, each carrying a source-span pointer. The prompt template enforces the typed output contract; output that does not parse against the schema is rejected and re-prompted.

**Stage 3 — Validation.** Each operation passes through the four-stage validator (§4.2). Operations that fail are quarantined; the analyst is shown the failure mode (which invariant was violated, which span is missing, which contradiction was averaged) and offered the choice to accept, revise, or reject.

**Stage 4 — Commit.** Accepted operations are committed to the graph with bi-temporal stamps. Commits are append-only at the transaction-time layer; invalidations are recorded as new transaction-time edges, not as deletions.

### 5.2 Comparison with related patterns

| Pattern | Schema | Provenance | Contestation | Temporality |
|---|---|---|---|---|
| **RAG** (Lewis et al., 2020) | None | Document-level | Averaged | Flattened |
| **GraphRAG** (Edge et al., 2024, arXiv:2404.16130) | Auto-induced, flat | Entity-level | Averaged | Mostly flattened |
| **OG-RAG** (Sharma et al., EMNLP 2025) | Domain ontology, fixed | Entity-level | Not modelled | Not modelled |
| **Think-on-Graph** (Sun et al., ICLR 2024) | Domain KG, fixed | Path-level | Not modelled | Not modelled |
| **Agentic deep graph reasoning** (Buehler 2025) | Schema-free, self-organising | Path-level | Not modelled | Not modelled |
| **OAG (TACITUS)** | Kernel + dynamic extensions | Span-level | First-class objects | Bi-temporal |

Each row is the right architecture for a different question. RAG is optimal for "find me relevant text." GraphRAG is optimal for query-focused summarisation over corpora too large to read. OG-RAG is optimal when a stable domain ontology exists and retrieval needs grounding against it. Think-on-Graph is optimal when reasoning structure dominates corpus structure. Buehler's agentic deep graph reasoning is optimal for open-ended discovery where the categories themselves are part of what is being discovered. OAG is optimal for typed institutional analysis with contested provenance and bi-temporal honesty — the question form that policy desks, mediation teams, regulatory bodies, and political-affairs offices ask by default.

---

## 6. Why LLM-only architectures fail at this task

The five failure modes specified in §1 map onto a body of measured results across 2023–2026.

**Temporal flattening (F1).** Chen et al. (2023, arXiv:2306.08952) benchmarked LLM temporal reasoning across reasoning categories (event-event ordering, event-time relations, frequency, duration); the consistent finding is that performance degrades sharply when explicit dates are absent and order must be inferred from discourse cues. The TCGC `event-ordering` task is constructed against this finding: dates are stripped from ~50% of items and systems are scored by Kendall's τ against the gold chronological order. The Zep/Graphiti and MedKGent results show that temporal honesty is achievable — but only when the bi-temporal structure is enforced *outside* the LLM by the surrounding graph, with the LLM constrained to extraction.

**Causal collapse (F2).** Kıcıman et al. (2023, arXiv:2305.00050) documented that multi-hop causal inferences of the form *A caused B via mechanism M under condition C* exceed what attention reliably constructs at the scale required for policy analysis. The TCGC `causal-chain` task isolates this capability and reports correlation-vs-causation precision separately from extraction. In internal runs, strong frontier models (Claude Opus 4.7, GPT-5, Gemini 3, Llama 4) produce confident causal narratives that, on careful inspection, are dressed-up correlations.

**Provenance absence (F3).** Ji et al. (2023, ACM Computing Surveys 55(12)) catalogued the breadth of the hallucination problem. RAG reduces the rate but operates at document granularity. OG-RAG (Sharma et al., 2025) demonstrates that ontology-grounded retrieval delivers 30% faster attribution; the underlying problem is that generation is still the same transformer, with span-to-output binding broken at the generation boundary. OAG's span-level grounding is enforced at the validation stage rather than learned by the model.

**Contradiction averaging (F4).** WikiContradict (Hou et al., NeurIPS 2024) measured 253 human-annotated cases of equally trustworthy contradictory passages from Wikipedia; all evaluated frontier models tended to produce answers that did not accurately reflect the conflict, especially for implicit conflicts that required reasoning. ConflictBank (Su et al., NeurIPS 2024, arXiv:2408.12076) scaled the analysis to 7.45M claim-evidence pairs and 553,117 QA pairs across misinformation, temporal, and semantic conflict types. The consistent result across Llama, Qwen, Mistral, and GPT families is that knowledge conflicts represent a structural source of hallucination not solvable by larger models.

**Context decay (F5).** Letta's Context-Bench (2025) measured strong 2025–2026 frontier models on contamination-proof, multi-step agentic tasks; the ceiling is ~74% with substantial degradation as conversation length grows and self-generated context accumulates. Anthropic's "Effective context engineering for AI agents" (September 2025) reframes the discipline around exactly this finding: "context is a critical but finite resource." Anthropic's follow-up on "effective harnesses for long-running agents" (November 2025) extends to multi-session work and describes the failure mode in operational terms: "imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift." The architectural response is to put the state outside the model.

The convergent conclusion across the 2024–2026 literature is that LLM-only architectures are excellent at pattern recognition over unstructured text and unreliable at maintaining typed structure, temporal honesty, causal precision, span-level provenance, contradiction handling, and long-horizon context. The right response is not a larger model; it is a typed graph the model reads from and writes to under validation. That is OAG.

---

## 7. The DIALECTICA engine: seven-graph architecture

DIALECTICA implements OAG as a neurosymbolic Option-2 architecture (Garcez & Lamb, 2023, *AI Review* 56(11), 12387–12406): a neural system (LLM extraction) interacting with a separate symbolic reasoning system (typed graph plus validator). The two components cooperate by construction rather than by patching. The neural side performs the task LLMs are reliable at (pattern recognition over noisy text); the symbolic side performs the task LLMs are not reliable at (maintaining typed structure that does not drift under self-generated context).

The engine partitions state across seven logical graphs, each with its own update discipline.

| Graph | Content | Update discipline | Mutability |
|---|---|---|---|
| Base | Raw ingested facts as extracted | Append-only at ingest | Immutable |
| Evidence | Source spans, citations, dataset references | Immutable on write; cryptographic-style binding to Base | Immutable |
| Analytical | Typed primitive instances and edges (the working ACO subgraph) | Bi-temporal; revisions logged as new edges with new `t_tx`; invalidation, not deletion | Append-only at `t_tx` |
| Narrative | Framings, drift chains, competing stories | Drift-tolerant by design; all framings preserved | Append-only |
| Commitment | Bi-temporal commitment register with status transitions | StatusTransition records; revision chains preserved | Append-only at `t_tx` |
| Case | Practitioner-facing subgraph for a particular dossier | Forkable per case; case-scoped extensions | Mutable within case |
| Scenario | Counterfactual / prospective subgraphs for options analysis | Forked from Case; isolated; never merged back without explicit promotion | Mutable within scenario |

### 7.1 Update discipline

The partition is the engineering thesis. Each graph has a different mutability contract because each serves a different query workload.

Base and Evidence are immutable because they constitute the audit trail. Any operation that would mutate Base or Evidence is rejected at the validator; corrections to ingested facts are written as new Analytical edges with `CORRECTS` relation and explicit transaction-time stamps. The Base/Evidence layer is what makes the system auditable by external reviewers: a partner institution can verify that any Analytical claim traces back to an immutable source span.

Analytical is append-only at the transaction-time layer. Revisions are not destructive; an analyst who corrects a typed instance generates a new edge with a new `t_tx_start`, and the prior edge has its `t_tx_end` set to the current transaction time. Both versions remain queryable. This is what supports retrospective analysis (*what did the desk officer believe on day T about the situation on day T'?*).

Narrative is drift-tolerant. All framings of the same target by all framers are preserved; the `drift_chain` orders them by valid-time. Narrative is the graph most adjacent to natural language; it is also the graph where typed structure is hardest to enforce, which is why the `narrative-drift` TCGC task scores drift-detection precision and recall separately rather than collapsing to a single metric.

Commitment is the graph where bi-temporal load is highest. StatusTransition records are the unit of mutation; the kernel invariant `Active → Fulfilled` requires intervening evidence is enforced at this layer. Commitment is the graph practitioners interact with most directly in PRAXIS, because the commitment register is the dossier object that brief and re-brief operations are built around.

Case is forkable per dossier. A Case is a named subgraph of the Analytical, Narrative, Commitment, and Scenario layers, scoped to a particular file. Case-scoped extensions are scoped to the Case; a Case can graduate an extension to the kernel only through explicit review.

Scenario is forked from Case and never merged back without explicit promotion. Scenarios support counterfactual queries (*what would the constraint landscape look like if Actor X withdrew the commitment?*) and prospective queries (*what is the predicted reception of policy option P among coalitions C1, C2, C3?*). The Wind Tunnel reception-modelling component reads from and writes to Scenario; CONCORDIA writes Case-scoped live-deliberation graphs that promote to Analytical on session close.

### 7.2 Reasoning layers

Above the seven graphs, DIALECTICA exposes four reasoning layers used by client applications:

- **GND (Grounded)** — point-in-time queries against Base+Evidence: "what does the source corpus say, character-exact?"
- **CTX (Context)** — typed queries against Analytical+Narrative+Commitment: "what is the structure of the case?"
- **EVD (Evidence)** — provenance traversal: "for this claim, what is the chain of source spans, derivations, and revisions?"
- **RZN (Reasoning)** — open-question layer where the LLM is brought back in, constrained by the typed graph as a hard scaffold rather than a soft retrieval source.

The RZN layer is where the cooperation between symbolic and neural components is most visible. The LLM is given the typed subgraph, the question, the source spans, and the kernel schema; it is required to emit a reasoned answer plus a graph-trace showing which typed nodes and edges supported each step. Answers without traces are rejected; traces that reference nodes that do not exist are rejected; traces that reference nodes whose `t_tx` is later than the question's reference time are rejected. The discipline is what allows the engine to "explain why" rather than "guess in confident prose."

---

## 8. Specialisations

The kernel and the OAG pattern are the architecture. Specialisations are per-domain extension stacks that ride on top.

### 8.1 Conflict reasoning (TCGC)

The first specialisation, because it stresses every property of the data model simultaneously: time, causality, provenance, commitment tracking, interest/position separation, narrative drift, cross-actor contradiction. The TACITUS Conflict Grammar Corpus (TCGC) is the public academic benchmark.

**v0.1**: 14 task types — `actor-resolution`, `claim-extraction`, `interest-extraction`, `constraint-extraction`, `leverage-mapping`, `commitment-tracking`, `event-ordering`, `narrative-drift`, `causal-chain`, `contradiction-detection`, `provenance-attribution`, `commitment-claim-mismatch`, `position-interest-separation`, `cross-document-synthesis` — across seven domains (workplace, commercial, governance, peace process, policy, family, diplomatic). Public v0.1 sample: 5 items with JSON schema and submission process. Per-metric, per-task-type, per-domain reporting — no single headline score.

**v0.2**: adds 3 task types specifically targeting the dynamic-ontology behaviour — `schema-extension-induction`, `kernel-invariant-validation`, `cross-domain-primitive-transfer` — and a target corpus size of 480+ items. The v0.2 design lets the benchmark measure not only how well a system extracts against a fixed ontology but how well it induces, validates, and transfers extensions.

Evaluation harness: HELM- and lm-eval-harness-compatible via a thin adapter, shipping alongside the first public split. Frontier model coverage: Claude Opus 4.7 / Sonnet 4.6 / Haiku 4.5, GPT-5 / GPT-4.1, Gemini 3 / 2.5 Pro, Llama 4, Grok 3, Mistral Large 3, DeepSeek R2. Baselines: RAG (vanilla), GraphRAG (Edge et al., Microsoft Research), OG-RAG (Sharma et al., EMNLP 2025), and a strong hybrid neurosymbolic baseline.

Reference comparison: generic 2026 policy comprehension benchmarks (RAND, PolicyBench) report 48–54% accuracy. The TCGC measures the harder layer underneath those benchmarks.

### 8.2 Policy options analysis

Same kernel; extensions for `PolicyOption`, `RegulatoryConstraint`, `JurisdictionalLeverage`, `StakeholderResponse`. The architectural constraint: an options memo produced through OAG should be traceable, position-by-position, back to the source spans that produced each Interest and each Constraint, and bi-temporally accurate about which positions were held by which actors at which moments. Specialisation benchmark in design.

### 8.3 Mediation and ADR

Extensions for `HR-Commitment`, `AgreementClause`, `FacilitationEvent`, `RoomNarrative`. Surfaces in CONCORDIA: the system listens during a live session, structures the transcript against the ACO, surfaces commitments and contradictions in near-real-time, and binds every utterance to a primitive. The CONCORDIA boundary is enforced architecturally: the system writes to Narrative and Commitment, surfaces structure, and does not write to a hypothetical `Adjudication` primitive (which does not exist in the kernel and will not).

### 8.4 Regulatory contestation

Extensions for `RegulatoryAct`, `StakeholderResponse`, `JudicialReviewEvent`, `ConsultationWindow`. This is where the bi-temporal grounding earns its keep most visibly: consultation periods open and close on specific dates, commitments made during them have legal weight, and the difference between "we will redraft" (active commitment, valid time T) and "the redraft is exploratory" (denial of scope, transaction time T+3) is the difference between a grounded challenge and a speculative one.

### Appendix · Products as engine surfaces

PRAXIS is the analyst workbench (read/write surface over Case, Analytical, Commitment, Narrative). DIALECTICA is the engine described in §7. Wind Tunnel writes Scenario subgraphs for reception modelling. CONCORDIA writes Case-scoped live-deliberation graphs. ARGUS is the document-intelligence ingestion surface (writes Base and Evidence). All products read and write the same graph schema; their boundaries are surface-level, not engine-level.

---

## 9. Risks

**Misuse for adversarial analysis.** The same engine that structures a workplace dispute structures a corporate leverage map. The same engine that assembles an options memo assembles an opposition file. Architectural mitigation is limited: kernel and pipeline are deliberately neutral; deployment governance is where the risk is managed. Reference engine runs on-premises or in partner-controlled tenants; we publish methodology, not cases; partner agreements specify reviewable use cases.

**Schema drift.** Per-case extensions can drift. An `HR-Commitment` and a `RegulatoryCommitment` should remain distinct subclasses; if the validator drifts, cross-domain transfer breaks. Mitigation: kernel invariants enforced at the validator (§4.2); the `kernel-invariant-validation` TCGC task explicitly measures drift rates; the open question of when an extension graduates to the kernel (TCGC open research question Q8) remains unsettled and is under public review.

**Provenance forgery.** Span-level provenance is only as good as the binding mechanism. If an upstream extraction agent fabricates a span pointer that does not correspond to actual text, the validator catches the syntactic violation but not the semantic one. Mitigation: cryptographic-style binding between Base and Evidence layers (content hashes on ingest; commits to Evidence are accompanied by hash receipts; downstream Claim and Commitment spans are verified against the receipt chain at validation time). This is implementation work in progress.

**Anchoring and over-trust.** A typed graph with span-level citations *looks* authoritative. The PRAXIS surfaces are built around inspect-edit-export-trace flows precisely because the architecture is contestable by design but practitioners may default to treating the output as a verdict. Mitigation is partly UI (every output is editable; every claim shows its source span on hover; every status is shown with its revision chain) and partly training.

**Training-data leakage.** Policy and political data is sensitive. Mitigation: reference engine never trains on partner-provided graphs; partner deployments run on-premises or in partner-controlled tenants; TCGC corpus construction anonymises real cases and uses synthetic scaffolding for the dynamic-ontology splits; public v0.1 sample is deliberately small (5 items).

**Surveillance adjacency.** A platform that structures who said what to whom with bi-temporal stamps is architecturally adjacent to a surveillance platform. Mitigation is architectural and policy. The `Leverage` primitive captures asymmetric power because asymmetric power is the substance of conflict; it does not capture it to be optimised by holders. The audit trail is designed to be inspected by the parties to a dispute, not weaponised against them.

---

## 10. Open source and benchmark commitments

1. **Kernel public and forkable.** The Agentic Conflict Ontology — eight primitives, ~50 relation types, ~30 kernel invariants, extension specification format — lives in the `tacitus-ontology` GitHub repository, MIT-licensed.

2. **Pipeline MIT-licensed.** The TACITUS Knowledge Pipeline (ingestion, extraction, typed validation, span binding, bi-temporal commit) and the DIALECTICA A2 engine reference implementation are open source.

3. **Extension log reviewable.** Every per-case subclass induced through the pipeline is logged with provenance (proposer, source case, validated against which invariants, accepted or rejected). Extensions proposing to graduate to the kernel pass through public review.

4. **Benchmark shared.** TCGC v0.1 public sample (5 items) plus JSON schema; full corpus available under a light data-use agreement for academic researchers and pilot partners; HELM and lm-eval-harness adapters ship with the first public split. Dataset paper targeted for Q4 2026; OAG methodology paper with reference implementation targeted for ACL Q1 2027.

---

## 11. References

### Foundations

1. Fisher, R., Ury, W., Patton, B. (1981). *Getting to Yes*. Houghton Mifflin.
2. Glasl, F. (1999). *Confronting Conflict*. Hawthorn Press.
3. Galtung, J. (1990). Cultural Violence. *Journal of Peace Research*, 27(3), 291–305.
4. Axelrod, R. (1984). *The Evolution of Cooperation*. Basic Books.
5. Sunstein, C. R. (1994). Incommensurability and Valuation in Law. *Michigan Law Review*, 92(4), 779–861.
6. Kingdon, J. W. (2003 [1984]). *Agendas, Alternatives and Public Policies* (2nd ed.). Longman.
7. Sabatier, P. A., Weible, C. M., eds. (2014). *Theories of the Policy Process* (3rd ed.). Westview.
8. Baumgartner, F. R., Jones, B. D. (1993). *Agendas and Instability in American Politics*. University of Chicago Press.
9. Hogan, A. et al. (2021). Knowledge Graphs. *ACM Computing Surveys*, 54(4).
10. Garcez, A. d'A., Lamb, L. C. (2023). Neurosymbolic AI: The 3rd Wave. *Artificial Intelligence Review*, 56(11), 12387–12406.

### LLM failure-mode literature

11. Chen, W. et al. (2023). Benchmarking LLMs on Temporal Reasoning. arXiv:2306.08952.
12. Kıcıman, E. et al. (2023). Causal Reasoning and Large Language Models. arXiv:2305.00050.
13. Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. *ACM Computing Surveys*, 55(12).
14. Hou, Y. et al. (2024). WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts. NeurIPS 2024 D&B Track. arXiv:2406.13805.
15. Su, Z. et al. (2024). ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in LLMs. NeurIPS 2024 D&B Track. arXiv:2408.12076.

### Retrieval, ontology, and generation patterns

16. Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020.
17. Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. arXiv:2404.16130.
18. Sharma, K., Kumar, P., Li, Y. (2025). OG-RAG: Ontology-Grounded RAG for Large Language Models. EMNLP 2025. ACL Anthology 2025.emnlp-main.1674. arXiv:2412.15235.
19. Sun, J. et al. (2024). Think-on-Graph. ICLR 2024.
20. Buehler, M. J. (2025). Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks. arXiv:2502.13025; *J. Mater. Res.* 40, 2204.
21. Babaei Giglou, H., D'Souza, J., Mihindukulasooriya, N., Auer, S., eds. (2025). LLMs4OL 2025: The 2nd Large Language Models for Ontology Learning Challenge at the 24th ISWC. Open Conference Proceedings 6.
22. Beliaeva, A., Rahmatullaev, T. (2025). Heterogeneous LLM Methods for Ontology Learning. LLMs4OL 2025. arXiv:2508.19428.
23. Bian, H. (2025). LLM-Empowered Knowledge Graph Construction: A Survey. arXiv:2510.20345.

### Temporal knowledge graphs and agentic memory

24. Zhang, D. et al. (2025). MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graphs. arXiv:2508.12393.
25. Rasmussen, P. et al. (2025). Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956.
26. Neo4j / Zep (2025). Graphiti: Build Real-Time Knowledge Graphs for AI Agents.
27. Anthropic (2025). Effective Context Engineering for AI Agents. Anthropic engineering blog, 29 September 2025.
28. Anthropic (2025). Effective Harnesses for Long-Running Agents. Anthropic engineering blog, 26 November 2025.
29. Letta (2025). Context-Bench: Benchmarking LLMs on Agentic Context Engineering.

### Ontology engineering

30. Héja, G., Surján, G., Varga, P. (2008). Ontological Analysis of SNOMED CT. *BMC Medical Informatics and Decision Making*, 8(S1), S8.
31. Schulz, S. et al. (2023). SNOMED CT and Basic Formal Ontology — Convergence or Contradiction Between Standards? *Applied Ontology*. doi:10.3233/AO-230018.
32. Keet, C. M., Grütter, R. (2021). Toward a Systematic Conflict Resolution Framework for Ontologies. *Journal of Biomedical Semantics*. doi:10.1186/s13326-021-00246-0.

### Institutional context

33. NIST (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0).
34. OECD (2025). Governing with Artificial Intelligence: Are governments ready? OECD Publishing.
35. ACLED. Conflict categories and codebook methodology.

---

## Cite this

```bibtex
@techreport{tacitus2026knowledgelayerv2,
  title   = {The AI Knowledge Layer for Policy and Political Work:
             Kernel Ontology, Dynamic Extensions, and
             Ontology-Augmented Generation},
  author  = {{TACITUS Research}},
  year    = {2026},
  note    = {Version 2.0},
  url     = {https://www.tacitus.me/research/vision}
}
```

[Read the kernel](https://www.tacitus.me/research/grammar) · [Read the OAG definition](https://www.tacitus.me/research/oag) · [Read the TCGC benchmark](https://www.tacitus.me/research/tcgc) · [Open the engine](https://www.tacitus.me/product/dialectica)