Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

Researchers have identified a critical gap between what LLMs internally represent about causal relationships and what they output verbally. Using linear probes on hidden states, they recovered near-perfect causal reasoning (97% accuracy) on anti-commonsense questions, yet the models' Yes/No responses collapsed to random performance. This 'Causal Tongue-Tie' reveals that benchmark failures may mask genuine internal understanding, while successes may reflect surface pattern-matching rather than causal cognition. The finding undermines confidence in output-only evaluations and suggests that assessing LLM reasoning requires probing beyond final tokens to distinguish between encoding deficits and expression failures.

Modelwire context

Explainer

The deeper provocation here is directional: if linear probes on hidden states can recover causal structure that verbal outputs cannot, then the standard practice of treating benchmark scores as proxies for internal reasoning is not just imprecise but potentially inverted. A model could score well on causal benchmarks through surface pattern-matching while a model that scores poorly might actually encode the correct causal structure.

This connects directly to the surface-versus-semantic noise study covered the same day ('When Do LLM Agents Treat Surface Noise Differently'), which found that meaning-altering perturbations shift model outputs nearly 20 percentage points more than cosmetic changes. Both papers are converging on the same uncomfortable finding from different angles: what a model outputs is a poor and sometimes misleading signal of what it has internally computed. Together they build a case that output-only evaluation is structurally inadequate, not merely noisy.

The critical next step is whether probing methods like linear decoding on hidden states get incorporated into a major public benchmark suite within the next 12 months. If CLadder or a successor adopts probe-based scoring alongside Yes/No accuracy, that would confirm the field is treating this as a measurement problem rather than a curiosity.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLLMs · CLadder · linear probe

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.