Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation

A new benchmark reveals that large language models struggle to replicate how humans deploy Turkish evidential morphology based on source credibility. Researchers manipulated the trustworthiness of information sources in controlled experiments and found native speakers consistently shift between the -DI and -mIs past-tense markers depending on perceived reliability. When tested across 10 LLMs under three prompting strategies, model behavior proved inconsistent and heavily dependent on both architecture and instruction framing. This work exposes a gap in how current systems handle pragmatic reasoning tied to epistemic trust, a capability essential for reliable information processing across morphologically rich languages.

Modelwire context

Explainer

The benchmark's deeper contribution isn't just exposing a gap in Turkish NLP coverage. It's demonstrating that source credibility, a pragmatic judgment humans make constantly, is encoded grammatically in Turkish in ways that force models to reason about epistemic trust at the morpheme level, not just the semantic one.

This connects directly to the 'Green Shielding' paper covered the same day, which introduced CUE criteria to measure how routine phrasing differences destabilize model outputs. Both papers are probing the same underlying fragility: models that respond inconsistently to surface-level input variation cannot be trusted in high-stakes contexts. The Turkish evidential benchmark adds a cross-linguistic dimension that Green Shielding's English-centric healthcare framing doesn't reach. Together they suggest the consistency problem isn't a niche edge case but a structural feature of current architectures, one that shows up whether you're varying prompt phrasing in English or switching grammatical markers in Turkish.

Watch whether any of the 10 tested models shows meaningfully better calibration when given explicit chain-of-thought prompting about source reliability. If structured reasoning closes the gap, that points toward a training fix; if it doesn't, the problem likely sits deeper in how these models represent epistemic stance.

Coverage we drew on

Green Shielding: A User-Centric Approach Towards Trustworthy AI · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsTurkish language · LLMs · evidential morphology · -DI marker · -mIs marker · source trustworthiness

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.