TACITUS is the company building PRAXIS — the AI analyst workbench for statecraft, policy, and mediation — powered by DIALECTICA, a neurosymbolic knowledge-graph engine that makes conflict reasoning structured, auditable, and contestable. Wind Tunnel and CONCORDIA apply the same backbone to political reaction risk and mediation intelligence.

How is TACITUS different from standard LLMs?

Standard LLMs lack persistent memory, cannot track temporal ordering of events, and hallucinate on conflict facts. TACITUS provides a deterministic knowledge graph layer that grounds LLM reasoning.

What industries does TACITUS serve?

TACITUS serves policy teams, mediators, diplomatic analysts, peace and security practitioners, institutional governance teams, HR and legal dispute teams, and researchers working with complex contested situations.

What products are in the TACITUS suite?

PRAXIS is the flagship: the AI analyst workbench for statecraft, policy, and mediation. DIALECTICA is the trust graph and context layer underneath PRAXIS. Wind Tunnel models political reaction risk, and CONCORDIA provides mediation intelligence for live or asynchronous dialogue.

How does TACITUS work?

TACITUS combines a deterministic typed knowledge graph with LLM reasoning. Unstructured text about policy situations, conflicts, negotiations, and institutional disputes is extracted into a graph using an 8-primitive ontology (Actor, Claim, Interest, Constraint, Leverage, Commitment, Event, Narrative), preserving temporal ordering, causal chains, and source provenance.

Is TACITUS open source?

Five repositories are open source on GitHub under MIT: AGON (Rust conflict vision), KAIROS (Rust temporal vision), DIALECTICA (the neurosymbolic engine reference), the TACITUS Knowledge Pipeline, and the Agentic Conflict Ontology (ACO). PRAXIS, Wind Tunnel, and CONCORDIA are commercial offerings with free demo access available; DIALECTICA also ships as a hosted runtime.

What is the Agentic Conflict Ontology (ACO)?

The ACO is a formal knowledge representation with 8 primitives (Actor, Claim, Interest, Constraint, Leverage, Commitment, Event, Narrative), 41+ typed classes, and 29+ typed properties. It is a universal grammar for structuring disputes, from HR grievances to peace processes.

Does TACITUS try to replace mediators or judges?

No. TACITUS is legibility infrastructure, not adjudication. The engine structures a dispute against the ontology and shows every party the same map. Decisions about compromise, fairness, and outcome remain with humans.

BENCHMARK · OPEN

TCGC — score the typed subgraph an AI produces, not the paraphrase.

The TACITUS Conflict Grammar Corpus is an open benchmark for structural conflict reasoning — grounded in the Agentic Conflict Ontology. We do not score free prose. We score the typed subgraph the system actually produces: actors, claims, interests, constraints, leverage, commitments, events, narratives, and the typed edges between them. Every primitive must cite the source span it was extracted from. Every commitment is bi-temporal. Every contradiction is a first-class edge — not a smoothed paragraph.

GitHub repo schema/tcgc-v0.1.json Sample item: tcgc-0001

14 task types · v0.1·3 dynamic-ontology tasks · v0.2·7 domains·5 metrics·15 published items + 6 PD corpora

THE BENCHMARK IN ONE MINUTE

What a task actually asks

1 · Input

A realistic document — a cable, a communiqué, a transcript excerpt — containing contested claims, commitments with dates, and at least one deliberate contradiction. Synthetic where needed, public-domain where possible, never scrubbed of the mess that makes the work hard.

2 · Required output

A typed subgraph, not prose: every primitive instance with its class, every edge typed, every node citing the character span it came from, every temporal claim carrying both clocks, every contradiction preserved as an explicit edge between the conflicting claims.

3 · Scoring

Five metrics, all mechanical: typed node and edge F1 against the gold subgraph, span-fidelity (did the citation actually contain the claim?), temporal correctness (both clocks, interval relations), and contradiction preservation (smoothing a conflict into one confident sentence scores zero on that item).

Why these metrics and no others: they are exactly the disciplines a Context Capsule enforces — receipts, types, two clocks, surfaced disputes. TCGC is the measuring stick for whether any system, ours included, can produce knowledge that survives checking. Scores appear on the evidence page only from published, reproducible runs — frontier-model baselines first, our own submission when the eval harness is complete. Never placeholder numbers.

WHY THIS BENCHMARK EXISTS

Five architectural failures the LLM stack does not patch.

Generic language models — even at frontier scale — fail on five measurable axes when handed a real conflict, mediation, or policy file. A larger model does not fix any of them; they are properties of transformer architecture. TCGC measures the gap directly, on the same models, with and without the typed layer.

F1 · Temporal flattening

Events ordered narratively, but bi-temporal structure collapses when dates are stripped.

F2 · Causal collapse

A→B via mechanism M under condition C reduces to co-occurrence.

F3 · Provenance absence

RAG cites at document level; the span a claim came from is lost in generation.

F4 · Contradiction averaging

Disagreeing passages get smoothed into a single paraphrase.

F5 · Context decay

Self-generated context accumulates; accuracy degrades while fluency holds.

STATUS · 2026-05-13

No measured results yet. We are not pretending otherwise.

The benchmark is wired end-to-end — schema, harness, scorers, baselines, leaderboard CI — but no model has been run against it in a reproducible way yet. We have not yet wired the model backends inbaselines/ to a real API key. We have not yet run the LLM judge for the interpretive tasks. We have not yet published a system card with a scorer commit SHA. Until those four things land we cannot honestly report numbers, and we will not.

What we do have is the shape of what the gap looks like, documented in three side-by-side illustrative comparisons (Melian leverage, regulatory commitment drift, sanctions causal chain). The vanilla columns are hand-written approximations of what a competent frontier chat model would plausibly say. The typed columns are constructed from the gold subgraphs to show the target output a passing system would produce. The axes (mechanism typing, bi-temporal stamps, status transitions, contradiction edges, span pointers) are well-defined; the magnitudes are an empirical question for v0.2 (Q3 2026).

RESULTS.md — illustrative comparisons + the path to real numbers experiments/llm_vs_typed/ (runnable harness)

TCGC-0010diplomatic

Melian leverage mapping

AXIS · Mechanism typing on Leverage

vanilla shapegestures at "coercive vs normative" — typed mechanism field absent

typed targetLeverage primitives with `mechanism=Coercive` / `Normative`, Procedural/Hard constraint, material CONTRADICTS edge

TCGC-0016policy / regulatory

Commitment vs claim drift

AXIS · Two distinct commitments preserved or merged

vanilla shapetends to merge into "commitments" plural; "by May, however" flattens valid-time

typed targettwo Commitment nodes, independent status + valid_time, Active → Contested status transitions

TCGC-0018policy / sanctions

CAUSES vs PRECEDES

AXIS · Causal claim under validator constraint

vanilla shapeuses "which" — same syntax for causal and temporal

typed targetCAUSES requires `mechanism` + `conditions`; validator rejects bare causal claims

ILLUSTRATIVE · NOT MEASURED · NO MODEL API CALLED FOR THESE COMPARISONS

THE NOVEL SURFACE

Vanilla LLM vs PRAXIS-layered — same model, same passage, two surfaces.

The repo holds paired worked examples that run the same model on the same passage through (A) a vanilla chat prompt and (B) the PRAXIS-style typed pipeline. The diff is what the typed layer adds.

Axis	Vanilla LLM	PRAXIS-layered
Schema	None — free prose response.	Typed kernel: 8 primitives + 18 edges + bi-temporal stamps.
Provenance	No span pointers; uncited assertions.	Every primitive cites (doc_id, char_start, char_end). Orphans fail validation.
Time	Narrative order; date-stripped chronology collapses.	valid-time + transaction-time on every node and edge; queryable independently.
Contradiction	Smoothed paraphrase ("broadly agree…").	Two Claim nodes + CONTRADICTS edge with materiality + rationale.
Position / Interest	Conflated; position restated as interest.	Distinct primitives. Interest carries mandatory derivation chain.
Commitment	Any stated-future-action becomes "commitment".	Commitment requires status + binding strength; mere assertion stays a Claim.
Output shape	Paragraph the analyst must re-parse.	JSON Lines of typed graph ops the next system reads directly.

Read the worked example — Melian Dialogue, leverage mapping

Task	Metric	What it stresses
Actor resolution	`graph_overlap`	Cross-document alias clusters; role and jurisdiction attribution.
Claim extraction	`graph_overlap`	Speech-act typing under noise — Asserted / Denied / Reported / Withdrawn.
Interest extraction	`llm_judge_anchored`	Inferred interests with mandatory derivation chains (Fisher/Ury).
Constraint extraction	`graph_overlap`	Statutory / regulatory / procedural classification with binding strength.
Leverage mapping	`graph_overlap`	Mechanism typing: Coercive / Normative / Information / Network / Procedural / Resource / Reputational.
Commitment tracking	`graph_overlap`	Bi-temporal status transitions (Active → Contested → Broken / Fulfilled).
Event ordering	`kendall_tau`	Chronology reconstruction with dates stripped on ~50% of items.
Narrative drift	`llm_judge_anchored`	Reframing across time and framer; cross-corpus drift chains.
Causal chain	`graph_overlap`	Multi-hop A→B via mechanism M under condition C; PRECEDES vs CAUSES distinction enforced.
Contradiction detection	`contradiction_pair_f1`	Material vs cosmetic disagreement; CONTRADICTS edges with rationale.
Provenance attribution	`provenance_f1`	Span-exact source binding — orphan provenance fails validation.
Commitment / claim mismatch	`graph_overlap`	Said X, signed Y, did Z — the gap between Claim modality and Commitment status.
Position / interest separation	`llm_judge_anchored`	Distinct primitives with auditable derivation; tiered confidence.
Cross-document synthesis	`graph_overlap`	Coherent subgraph assembled from multiple, partially-contradictory sources.

V0.2 · DYNAMIC-ONTOLOGY TASKS

Schema extension induction

Given a case in a new domain, induce per-case extension subclasses that inherit cleanly from kernel primitives.

Kernel invariant validation

Given a proposed extension, hold every parent-primitive invariant. Adversarial split includes plausible-but-invalid extensions.

Cross-domain primitive transfer

Same kernel, different domain; primitives transfer while extensions specialize.

DOMAINS · 7

workplace

HR grievances, performance disputes, promotion blocks

commercial

Contract breach, vendor dispute, joint-venture friction

governance

Board disagreement, mandate conflict, committee deadlock

peace-process

Ceasefire, DDR, political-track multilateral negotiation

policy

Regulatory contestation, stakeholder reception

family

Inheritance, custody, intergenerational wealth dispute

diplomatic

State-to-state friction, border incident, multilateral block

METRICS · 5

Per task, never collapsed.

We do not publish a single headline score. A system at 0.9 on actor resolution and 0.4 on commitment / claim mismatch is not a 0.65 system; those numbers tell a reviewer two different things about where it breaks.

Metric	Formula	When it can mislead
`graph_overlap`	weighted(node_jaccard, edge_jaccard) with partial credit on near-miss edge types	When gold has no edges — collapses to node Jaccard. The harness flags this in `notes`.
`provenance_f1`	F1 over (primitive_id, source_span) pairs	A system that emits no primitives at all is uncomparable; treat empty pred as 0 with a diagnostic.
`kendall_tau`	(τ-b + 1) / 2; missing events get worst-case rank	Short sequences (n < 4) have unstable τ; we cap items at n ≥ 5 for ordering tasks.
`contradiction_pair_f1`	materiality-weighted F1 over unordered (claim_a, claim_b) pairs	Cosmetic flagged as material inflates precision; ground-truth materiality is set by annotators with κ reported.
`llm_judge_anchored`	LLM-judge raw → isotonic regression against 20-entry human anchor set	Position / verbosity / self-preference bias — all disclosed in docstring. Gated behind TCGC_RUN_API=1.

EXPERIMENTS · REPEATABLE

Run the comparison yourself. One command.

The repo ships a concrete, provider-agnostic harness (Anthropic, OpenAI, plus an echo no-op client for CI). For every item, it runs the same model twice — once with a vanilla chat prompt, once with the typed ACO contract — and writes per-run records (prompt hash, response hash, elapsed ms, timestamp, model id) plus a side-by-side REPORT.md. All wiring is reproducible from a single Makefile target.

# Dry run — no API calls, deterministic.
make experiment-dry ITEMS=items/v0.2-public-domain/

# Real run with Anthropic.
export ANTHROPIC_API_KEY=sk-ant-...
make experiment-anthropic ITEMS=items/v0.2-public-domain/tcgc-0010.json

# Side-by-side report.
make experiment-report RUN=runs/anthropic-claude-opus-4-7-20260513T180000Z

Harness README — layout, cost discipline, path to first measured row

QUICKSTART · 60 SECONDS

Run the harness against your system.

git clone https://github.com/sargonxg/TCGC_TACITUS-Conflict-Grammar-Corpus_BENCHMARK tcgc
cd tcgc && pip install -e '.[dev]'

# Validate every item against the schema (structural + semantic).
tcgc validate items/

# Look at the canonical JSON Schema.
tcgc schema --version v0.1 | jq .

# Run your system against the v0.1-sample items.
tcgc run --system mypkg.runner:predict items/v0.1-sample/ --out predictions.jsonl

# Score and report.
tcgc score predictions.jsonl items/v0.1-sample/ --out scores.json
tcgc report scores.json

No GPUs required. No paid API calls unless your runner needs them. The llm_judge_anchored metric is gated behind TCGC_RUN_API=1.

SUBMIT

Open a PR. CI runs the canonical scorer. You get a per-task-type table back.

01 · Fork

Fork the repo. Branch from main.

02 · Run

Produce predictions.jsonl + SYSTEM_CARD.md.

03 · PR

Add files under submissions/<name>/<date>/. Open PR.

04 · Leaderboard

CI scores. Per-task-type table posted as PR comment.

Alternative path: email per-metric CSV + system card. Every leaderboard row carries the scorer commit SHA, so submissions are reproducible.

METHODOLOGY

Three-pass annotation. IAA reported per item. No headline averaging.

PASS 1 · Primitive tagging

Two independent annotators identify the 8 primitives. Disagreements logged.

PASS 2 · Edge labeling

Third annotator labels edges (typed relations) on the union of primitives.

PASS 3 · Ground-truth QA

Senior reviewer adjudicates disagreements; per-task IAA (Cohen κ) computed and stored on each item.

Full methodology in the repo

FAQ

Engineer questions, answered tight.

Is this a memorisation test?

No. The public-domain corpora are saturated in every frontier model's pre-training. The test is whether the system can produce the typed subgraph the source supports — actors with roles, claims with modality, commitments with status, leverage with mechanism, contradictions as edges. Memorising the text does not produce the structure.

Why not score on a single headline number?

A system at 0.9 on actor-resolution and 0.4 on commitment-claim-mismatch is not a 0.65 system. Those two numbers tell a reviewer two different things about where the system breaks. We report per task, per domain, never collapsed.

What does "LLM-judge anchored" mean exactly?

For interpretive tasks (interest extraction, position/interest separation, narrative drift) we use an LLM judge with a fixed prompt at temperature 0, then pass the raw score through isotonic regression fitted on a 20-entry set of human-anchored (raw, score) triples. Bias modes are disclosed in the scorer docstring.

How is partial credit awarded on edge types?

The kernel has 18 edge types. Some are near-synonyms with different semantic strength: ASSERTED ↔ ACKNOWLEDGED (same kind, different speaker role) gets 0.5; ACKNOWLEDGED ↔ ACKNOWLEDGED_AMBIGUOUSLY gets 0.75; ENABLES ↔ CAUSES gets 0.5 (CAUSES is strictly stronger). Off-diagonal entries are documented in tcgc/ontology/edges.py.

Why public-domain classics rather than synthetic data only?

Three reasons. (1) Stable, immutable canonical URLs — citations stay live. (2) Saturated in pre-training, so the benchmark is structural rather than recall. (3) Dense in typed primitives — one Melian paragraph encodes 8+ primitives and 3 contradictions; modern policy memos are sparser.

Can I submit a private system?

Yes. Open a PR with predictions + system card under submissions/<name>/<date>/. The leaderboard workflow runs the canonical scorer in CI and posts the per-task-type table back as a PR comment. The scorer commit SHA is part of every leaderboard row, so submissions are reproducible.

Is the corpus open?

The harness, the schema, the v0.1-sample items (5), the v0.2-public-domain items (6), and the kernel ontology are CC-BY-NC-SA 4.0 / MIT-licensed and live in the repo. The full corpus (~480 items, target Q3 2026) requires a Data Use Agreement.

Open kernel. Open harness. Per-task reporting.

The harness, schema, sample items, public-domain corpus manifests, and the kernel ontology are CC/MIT-licensed and live in the repo. The full corpus (~480 items, target Q3 2026) requires a Data Use Agreement. Submit, run, or extend — the citation is the contract.

Open the repo Read the kernel hello@tacitus.me

Loading TACITUS