BENCHMARK · FIRST SPECIALIZATION · v0.1-sample PUBLIC · v0.2 IN DEVELOPMENT
The first specialization of the knowledge layer measured in public. Conflict reasoning stresses every property of the layer at once — time, causality, provenance, commitment tracking, interest/position separation, narrative drift, cross-actor contradiction. If the layer holds here, the architecture generalizes to policy options, regulatory work, and mediation. TCGC is how we make that claim falsifiable.
Grounded in the kernel ontology with dynamic extensions. v0.2 introduces three task types for dynamic-ontology behavior. Other specializations (policy options, regulatory contestation, ADR) get their own benchmarks as they mature.
The full corpus is under a light data-use agreement. A public sample (v0.1, 5 items) plus the item schema are available right here on this page, so any agent, model, or researcher can start running against the format today.
QUICK-START FOR MACHINES
Download the sample, validate against the schema, run your system, report per task-type. A thin adapter for HELM and lm-eval-harness ships alongside the full release; for now, loop over the items directly.
{
"id": "tcgc-0001",
"task_type": "commitment-tracking",
"domain": "workplace",
"inputs": {
"messages": [
{ "day": 1, "time": "Mon 09:14", "from": "Sam",
"text": "So we're agreed — you own the Q4 launch deck content,
I handle design. Lock it in by Thursday?" },
{ "day": 1, "time": "Mon 09:47", "from": "Alex",
"text": "Sounds good. I'll pick it up after the Jenkins pitch." },
{ "day": 4, "time": "Thu 09:02", "from": "Alex",
"text": "I never said I'd own it. Just help." }
],
"question": "Was there a commitment on content ownership,
when was it made, and who asserted it?"
},
"gold": {
"commitment_id": "cm1",
"subject": "Q4 launch deck content",
"deadline": "Thursday",
"status": "contested",
"edges": [
{ "from": "Sam", "to": "cm1", "type": "ASSERTED",
"provenance": "msg1" },
{ "from": "Alex", "to": "cm1", "type": "ACKNOWLEDGED_AMBIGUOUSLY",
"provenance": "msg2" },
{ "from": "Alex", "to": "cm1", "type": "DENIES_SCOPE",
"provenance": "msg3" }
]
},
"rubric": {
"scoring": "graph_overlap + provenance_f1",
"graph_overlap_target": 0.85,
"provenance_f1_target": 1.0
}
}Validate
Check every item you plan to score against the JSON Schema at /tcgc/schema-v0.1.json. Reject items that fail validation instead of silently coercing.
Run
For each item, call your system with item.inputs and receive a predicted structure. Do not peek at item.gold.
Score
Apply the metric named in item.rubric.scoring. Report per-task-type numbers; aggregate only after publishing the breakdown.
Submit
Email the results (metric-by-task CSV plus a short system card) to hello@tacitus.me or open a PR on the benchmark repo.
WHAT THE TCGC MEASURES
Each task type targets a specific capability that standard retrieval-augmented generation cannot handle reliably. Metrics are reported per task type and per domain; aggregates come later.
actor-resolutionDisambiguate references and alias clusters across long documents.
claim-extractionSurface asserted facts, evaluative statements, and normative claims.
interest-extractionInfer underlying interests distinct from stated positions (Fisher/Ury).
constraint-extractionIdentify rules, norms, and structural bounds shaping feasible outcomes.
leverage-mappingAttribute leverage resources and dependencies to the actor holding them.
commitment-trackingDistinguish claims from commitments and track their evolution over time.
event-orderingReconstruct temporal sequence from mixed narrative prose.
narrative-driftDetect framing changes across time and party.
causal-chainBuild multi-hop causal chains with explicit mechanism and conditions.
contradiction-detectionIdentify claims that cannot simultaneously hold across actors or time.
provenance-attributionBind every extracted primitive back to its source span.
commitment-claim-mismatchFlag instances where stated commitment diverges from behavioral evidence.
position-interest-separationSeparate surface positions from underlying interests.
cross-document-synthesisAssemble a coherent conflict graph from multiple, partially-contradictory sources.
v0.2 · DYNAMIC ONTOLOGY
v0.2 measures whether a system can induce per-case extensions, validate them against the kernel, and transfer primitive recognitions across domains. In development.
schema-extension-inductionGiven a case in a new domain, induce the per-case extension subclasses for one or more kernel primitives, and validate them against kernel invariants.
kernel-invariant-validationGiven a proposed extension subclass, decide whether it preserves the parent primitive's invariants. Adversarial split includes plausible-but-invalid extensions.
cross-domain-primitive-transferGiven a typed graph from domain A (e.g. HR mediation) and a case from domain B (e.g. ceasefire), transfer the kernel primitive recognitions while letting the extensions specialize.
BASELINES AND WHAT THEY TEST
TCGC will report against model-only, RAG, GraphRAG, and TACITUS reference systems. The goal is not to crown one model. The goal is to show which architecture preserves the relationships professionals need to inspect.
Closed-book / prompt-only runs against the GPT-5, Claude Opus 4.x, Gemini 2.5/3, and Llama 4 families. Useful for measuring whether fluency hides missing structure. Headline finding so far: domain-general policy benchmarks land in the 48–54% range — comprehension is not the same as reasoning.
Chunk retrieval plus cited generation. Useful for measuring source access without explicit relationships. Performs well on extraction and citation, weakly on commitment tracking, contradiction detection, and temporal reconstruction.
Entity-and-relationship graph retrieval over the same corpus, including community-summarization variants. Useful for comparing domain-general graph retrieval to ACO-typed graph construction. Most public GraphRAG systems do not enforce an ontology, which limits neurosymbolic grounding.
A reference combining vector retrieval, KG traversal, and ontology-typed extraction. The 2025–2026 production trend in agent-era architectures. Closer to DIALECTICA in spirit; useful for separating engine effects from data effects.
DIALECTICA with ACO-typed primitives, graph-overlap scoring, source-span provenance, temporal DAG, causal edges, and task-specific evaluators. The system PRAXIS runs on. Reported as a reference, not a leaderboard winner; if a simpler baseline matches us on a task type, we say so.
FRONTIER MODEL COVERAGE · 2026 COHORT
TCGC tracks every major frontier-tier release as it lands. We re-run the corpus on new model versions and publish deltas. Generic policy benchmarks across 2026 (RAND, PolicyBench) report 48–54% accuracy on policy comprehension — TCGC measures the harder layer underneath: structure, time, causality, and provenance.
| Vendor | Models tracked | Notes |
|---|---|---|
| Anthropic | Claude Opus 4.7, Sonnet 4.6, Haiku 4.5 | Best long-context provenance behavior in our 2026 runs; cost-aware Sonnet suitable for graph-grounded loops. |
| OpenAI | GPT-5, GPT-4.1 | Strong on extraction; weaker on multi-hop commitment tracking without scaffolding. |
| Gemini 2.5 Pro, Gemini 3 | Strong document comprehension; multimodal helps for cable-and-image corpora. | |
| Meta | Llama 4 (open-weights families) | Useful for self-hosted statecraft contexts. Pairs well with on-prem KG. |
| xAI / Mistral / DeepSeek | Grok 3, Mistral Large 3, DeepSeek R2 | Tracked as comparators; benchmarked on the same TCGC items. |
Submission status snapshot — the leaderboard is open. Frontier vendors and self-hosted builders are invited to submit. Methodology, schema, and evaluators are public.
DOMAINS COVERED
workplace
HR grievances, performance disputes, promotion blocks
commercial
Contract breach, vendor dispute, joint-venture friction
governance
Board disagreement, mandate conflict, committee deadlock
peace-process
Ceasefire, DDR, political-track multilateral negotiation
policy
Regulatory contestation, stakeholder reception, public-consultation aftermath
family
Inheritance, custody arrangement, intergenerational wealth dispute
diplomatic
State-to-state friction, border incident, multilateral block formation
SCORING RUBRIC
Every item names its scoring method in rubric.scoring. Systems report per-metric numbers; we do not collapse to a single headline score.
graph_overlap|gold ∩ pred| / |gold ∪ pred|Jaccard over the typed subgraph the system returns vs the gold subgraph. Partial-credit aware.
provenance_f12·P·R / (P + R) on source-span matchesEvery primitive must cite the span it was extracted from. Exact span match scores full; overlapping span scores partial.
kendall_tauRank correlation against gold chronological orderDates are stripped on ~50% of items. The system must reconstruct order from discourse cues.
contradiction_pair_f1F1 over identified contradicting (claim_a, claim_b) tuplesWeighted variant upscales contradictions the mediator flags "material", downscales cosmetic ones.
llm_judge_anchoredLLM-judge score, anchored to a human-annotator calibration setFor interpretive tasks. The anchor set keeps the judge within the bounds of trained human judgment.
LEADERBOARD · SUBMISSIONS OPEN (v0.1 SAMPLE)
We are collecting a first cohort of v0.1-sample runs from the obvious baselines plus any system that wants to self-submit. Submissions are audited; scoring scripts ship in the benchmark repo.
Want to run against the sample and land on this board? Email hello@tacitus.me with your system card.
METHODOLOGY
TCGC items are drawn from two macro-domains: human friction (HR, commercial, governance) and complex multi-party scenarios (policy, peace process, multilateral), with intentional diversity in length, source mix, and discourse style.
Annotation proceeds in three passes: primitive tagging, edge labelling, and ground-truth question authoring. Inter-annotator agreement targets are task-type specific; tasks that depend on inferred primitives (like Interest) have lower targets than surface-level tasks (like Actor resolution), and we report the actual agreement transparently.
The evaluation harness is designed to be compatible with standard runners (HELM, lm-eval-harness) via a thin adapter. The adapter ships alongside the first public split. Every metric named in rubric.scoring has a reference implementation in the benchmark repo.
OPEN RESEARCH QUESTIONS
Eight questions we do not yet have clean answers for. The last two are about v0.2’s dynamic-ontology scoring. Click each to read the current thinking and tell us where it is wrong.
HOW TO CONTRIBUTE
TIER 01
Download the 5-item sample, run your system, send results. First entries land on the leaderboard stub above.
Download sampleTIER 02
The v0.1 evaluation protocol is in the benchmark repo. Comments accepted on annotation guidelines, task-type definitions, and metric choices.
Open repoTIER 03
Full-corpus splits are available to academic researchers and pilot partners under a light DUA. Write in with your proposed use case.
Request accessTIER 04
Found something the 14 current task types miss? Send us a one-paragraph proposal: motivation, worked example, suggested metric.
Propose a taskPUBLICATION PLAN