Loading TACITUS
Loading TACITUS
BENCHMARK · OPEN
The TACITUS Conflict Grammar Corpus is an open benchmark for structural conflict reasoning — grounded in the Agentic Conflict Ontology. We do not score free prose. We score the typed subgraph the system actually produces: actors, claims, interests, constraints, leverage, commitments, events, narratives, and the typed edges between them. Every primitive must cite the source span it was extracted from. Every commitment is bi-temporal. Every contradiction is a first-class edge — not a smoothed paragraph.
WHY THIS BENCHMARK EXISTS
Generic language models — even at frontier scale — fail on five measurable axes when handed a real conflict, mediation, or policy file. A larger model does not fix any of them; they are properties of transformer architecture. TCGC measures the gap directly, on the same models, with and without the typed layer.
F1 · Temporal flattening
Events ordered narratively, but bi-temporal structure collapses when dates are stripped.
F2 · Causal collapse
A→B via mechanism M under condition C reduces to co-occurrence.
F3 · Provenance absence
RAG cites at document level; the span a claim came from is lost in generation.
F4 · Contradiction averaging
Disagreeing passages get smoothed into a single paraphrase.
F5 · Context decay
Self-generated context accumulates; accuracy degrades while fluency holds.
STATUS · 2026-05-13
The benchmark is wired end-to-end — schema, harness, scorers, baselines, leaderboard CI — but no model has been run against it in a reproducible way yet. We have not yet wired the model backends inbaselines/ to a real API key. We have not yet run the LLM judge for the interpretive tasks. We have not yet published a system card with a scorer commit SHA. Until those four things land we cannot honestly report numbers, and we will not.
What we do have is the shape of what the gap looks like, documented in three side-by-side illustrative comparisons (Melian leverage, regulatory commitment drift, sanctions causal chain). The vanilla columns are hand-written approximations of what a competent frontier chat model would plausibly say. The typed columns are constructed from the gold subgraphs to show the target output a passing system would produce. The axes (mechanism typing, bi-temporal stamps, status transitions, contradiction edges, span pointers) are well-defined; the magnitudes are an empirical question for v0.2 (Q3 2026).
TCGC-0010diplomaticAXIS · Mechanism typing on Leverage
TCGC-0016policy / regulatoryAXIS · Two distinct commitments preserved or merged
TCGC-0018policy / sanctionsAXIS · Causal claim under validator constraint
ILLUSTRATIVE · NOT MEASURED · NO MODEL API CALLED FOR THESE COMPARISONS
THE NOVEL SURFACE
The repo holds paired worked examples that run the same model on the same passage through (A) a vanilla chat prompt and (B) the PRAXIS-style typed pipeline. The diff is what the typed layer adds.
| Axis | Vanilla LLM | PRAXIS-layered |
|---|---|---|
| Schema | None — free prose response. | Typed kernel: 8 primitives + 18 edges + bi-temporal stamps. |
| Provenance | No span pointers; uncited assertions. | Every primitive cites (doc_id, char_start, char_end). Orphans fail validation. |
| Time | Narrative order; date-stripped chronology collapses. | valid-time + transaction-time on every node and edge; queryable independently. |
| Contradiction | Smoothed paraphrase ("broadly agree…"). | Two Claim nodes + CONTRADICTS edge with materiality + rationale. |
| Position / Interest | Conflated; position restated as interest. | Distinct primitives. Interest carries mandatory derivation chain. |
| Commitment | Any stated-future-action becomes "commitment". | Commitment requires status + binding strength; mere assertion stays a Claim. |
| Output shape | Paragraph the analyst must re-parse. | JSON Lines of typed graph ops the next system reads directly. |
CORPORA · PROJECT GUTENBERG
We ground the v0.2 items in public-domain texts: Thucydides, the Federalist Papers, Hobbes, Machiavelli, Sun Tzu, Caesar. The text is saturated in pre-training — so the benchmark is not a memorisation test. It is a structural-reasoning test: can the system produce the typed subgraph the source actually supports?
PUBLIC DOMAIN
Hamilton, Madison, Jay
commitments, counter-positions, constitutional negotiation
PUBLIC DOMAIN
Thucydides · Crawley tr.
leverage asymmetry, ultimatum, commitment breach
PUBLIC DOMAIN
Machiavelli · Marriott tr.
virtù, fortuna, deliberate commitment-breach
PUBLIC DOMAIN
Sun Tzu · Giles tr.
leverage typology, narrative as mechanism
TASK TYPES · 14 + 3
| Task | Metric | What it stresses |
|---|---|---|
| Actor resolution | graph_overlap | Cross-document alias clusters; role and jurisdiction attribution. |
| Claim extraction | graph_overlap | Speech-act typing under noise — Asserted / Denied / Reported / Withdrawn. |
| Interest extraction | llm_judge_anchored | Inferred interests with mandatory derivation chains (Fisher/Ury). |
| Constraint extraction | graph_overlap | Statutory / regulatory / procedural classification with binding strength. |
| Leverage mapping | graph_overlap | Mechanism typing: Coercive / Normative / Information / Network / Procedural / Resource / Reputational. |
| Commitment tracking | graph_overlap | Bi-temporal status transitions (Active → Contested → Broken / Fulfilled). |
| Event ordering | kendall_tau | Chronology reconstruction with dates stripped on ~50% of items. |
| Narrative drift | llm_judge_anchored | Reframing across time and framer; cross-corpus drift chains. |
| Causal chain | graph_overlap | Multi-hop A→B via mechanism M under condition C; PRECEDES vs CAUSES distinction enforced. |
| Contradiction detection | contradiction_pair_f1 | Material vs cosmetic disagreement; CONTRADICTS edges with rationale. |
| Provenance attribution | provenance_f1 | Span-exact source binding — orphan provenance fails validation. |
| Commitment / claim mismatch | graph_overlap | Said X, signed Y, did Z — the gap between Claim modality and Commitment status. |
| Position / interest separation | llm_judge_anchored | Distinct primitives with auditable derivation; tiered confidence. |
| Cross-document synthesis | graph_overlap | Coherent subgraph assembled from multiple, partially-contradictory sources. |
V0.2 · DYNAMIC-ONTOLOGY TASKS
Given a case in a new domain, induce per-case extension subclasses that inherit cleanly from kernel primitives.
Given a proposed extension, hold every parent-primitive invariant. Adversarial split includes plausible-but-invalid extensions.
Same kernel, different domain; primitives transfer while extensions specialize.
DOMAINS · 7
workplace
HR grievances, performance disputes, promotion blocks
commercial
Contract breach, vendor dispute, joint-venture friction
governance
Board disagreement, mandate conflict, committee deadlock
peace-process
Ceasefire, DDR, political-track multilateral negotiation
policy
Regulatory contestation, stakeholder reception
family
Inheritance, custody, intergenerational wealth dispute
diplomatic
State-to-state friction, border incident, multilateral block
METRICS · 5
We do not publish a single headline score. A system at 0.9 on actor resolution and 0.4 on commitment / claim mismatch is not a 0.65 system; those numbers tell a reviewer two different things about where it breaks.
| Metric | Formula | When it can mislead |
|---|---|---|
graph_overlap | weighted(node_jaccard, edge_jaccard) with partial credit on near-miss edge types | When gold has no edges — collapses to node Jaccard. The harness flags this in `notes`. |
provenance_f1 | F1 over (primitive_id, source_span) pairs | A system that emits no primitives at all is uncomparable; treat empty pred as 0 with a diagnostic. |
kendall_tau | (τ-b + 1) / 2; missing events get worst-case rank | Short sequences (n < 4) have unstable τ; we cap items at n ≥ 5 for ordering tasks. |
contradiction_pair_f1 | materiality-weighted F1 over unordered (claim_a, claim_b) pairs | Cosmetic flagged as material inflates precision; ground-truth materiality is set by annotators with κ reported. |
llm_judge_anchored | LLM-judge raw → isotonic regression against 20-entry human anchor set | Position / verbosity / self-preference bias — all disclosed in docstring. Gated behind TCGC_RUN_API=1. |
EXPERIMENTS · REPEATABLE
The repo ships a concrete, provider-agnostic harness (Anthropic, OpenAI, plus an echo no-op client for CI). For every item, it runs the same model twice — once with a vanilla chat prompt, once with the typed ACO contract — and writes per-run records (prompt hash, response hash, elapsed ms, timestamp, model id) plus a side-by-side REPORT.md. All wiring is reproducible from a single Makefile target.
# Dry run — no API calls, deterministic. make experiment-dry ITEMS=items/v0.2-public-domain/ # Real run with Anthropic. export ANTHROPIC_API_KEY=sk-ant-... make experiment-anthropic ITEMS=items/v0.2-public-domain/tcgc-0010.json # Side-by-side report. make experiment-report RUN=runs/anthropic-claude-opus-4-7-20260513T180000Z
QUICKSTART · 60 SECONDS
git clone https://github.com/sargonxg/TCGC_TACITUS-Conflict-Grammar-Corpus_BENCHMARK tcgc cd tcgc && pip install -e '.[dev]' # Validate every item against the schema (structural + semantic). tcgc validate items/ # Look at the canonical JSON Schema. tcgc schema --version v0.1 | jq . # Run your system against the v0.1-sample items. tcgc run --system mypkg.runner:predict items/v0.1-sample/ --out predictions.jsonl # Score and report. tcgc score predictions.jsonl items/v0.1-sample/ --out scores.json tcgc report scores.json
No GPUs required. No paid API calls unless your runner needs them. The llm_judge_anchored metric is gated behind TCGC_RUN_API=1.
SUBMIT
01 · Fork
Fork the repo. Branch from main.
02 · Run
Produce predictions.jsonl + SYSTEM_CARD.md.
03 · PR
Add files under submissions/<name>/<date>/. Open PR.
04 · Leaderboard
CI scores. Per-task-type table posted as PR comment.
Alternative path: email per-metric CSV + system card. Every leaderboard row carries the scorer commit SHA, so submissions are reproducible.
METHODOLOGY
PASS 1 · Primitive tagging
Two independent annotators identify the 8 primitives. Disagreements logged.
PASS 2 · Edge labeling
Third annotator labels edges (typed relations) on the union of primitives.
PASS 3 · Ground-truth QA
Senior reviewer adjudicates disagreements; per-task IAA (Cohen κ) computed and stored on each item.
FAQ
The harness, schema, sample items, public-domain corpus manifests, and the kernel ontology are CC/MIT-licensed and live in the repo. The full corpus (~480 items, target Q3 2026) requires a Data Use Agreement. Submit, run, or extend — the citation is the contract.