Three structural failures: time, causality, provenance. Each one a property of transformer architecture, not a tuning problem.
Large language models do a remarkable amount of work well. Text summarization. Stylistic translation. Short-horizon code generation. Conflict reasoning is not on that list.
There are three specific failure modes, and they are architectural, not training-data problems.
The first is temporality. A transformer reads a sequence; it does not see a timeline. You can ask it "when did Party A commit to X?" and it will return a date-shaped string, but the answer is guessed from local context, not reconstructed from a model of the timeline. Strip explicit dates from your prompt and the "reasoning" collapses.
The second is causality. LLMs are statistical associators. "A and B appeared together in the pre-training corpus" is what they know. "A caused B via mechanism M under condition C" is a multi-hop inference with typed edges, and there is no structure in the model that can represent it. Chain-of-thought helps on the surface, but a causal chain that matters — the kind a mediator cares about — requires the model to maintain and update a graph, which it does not.
The third is provenance. Every claim a generic LLM emits is, architecturally, an uncited assertion. When the output looks fluent, readers assume it is grounded. It is not. The model cannot point at the span it came from; it cannot distinguish what it was told from what it interpolated.
People often reply that RAG fixes this. It does not. Retrieval gives you the relevant chunks; reasoning over those chunks is still happening inside the same architecture that flattens time, collapses causality, and invents sources. Chunk-level citation is not provenance. It is a footnote that may or may not be correct.
These failures are not patched by larger models. They are patched by putting an explicit structure — graph, typed edges, provenance — next to the model and making the model consult it. That is the neurosymbolic bet: language models for what they do well (fluency, extraction, generation), graph reasoning for what they do badly (time, causality, sourcing). Each covers the other’s blind spots.