How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

Illustration accompanying: How Hypocritical Is Your LLM judge? Listener-Speaker Asymmetries in the Pragmatic Competence of Large Language Models

Researchers found LLMs exhibit a systematic gap between judging pragmatic language appropriateness and generating it themselves, revealing a fundamental inconsistency in how these models handle real-world linguistic context across multiple model families.

Modelwire context

Explainer

The finding isn't just that LLMs make mistakes in pragmatic reasoning — it's that the failure is directional and systematic: models can recognize appropriate language use better than they can produce it, which means evaluation scores on pragmatic tasks may consistently overstate actual generative capability.

This lands directly on top of a cluster of LLM judge reliability concerns Modelwire has been tracking this week. The 'Diagnosing LLM Judge Reliability' paper from April 16 found that aggregate consistency scores (~96%) mask per-instance logical failures in one-third to two-thirds of documents. That paper was about transitivity violations in pairwise comparisons; this new work adds a different dimension: even when a model judges correctly, it may not be able to do what it just approved. The 'Context Over Content' paper from the same day showed judges distort verdicts based on stakes framing rather than actual output quality. Taken together, these three papers describe a judge layer that is unreliable in at least three distinct ways: logically inconsistent, context-manipulable, and now asymmetric with respect to production ability. That has real consequences for any pipeline that uses LLM self-evaluation or peer-evaluation as a quality gate.

Watch whether any of the major eval frameworks (LM Eval Harness, HELM, or similar) add listener-speaker asymmetry probes to their pragmatic benchmarks within the next two release cycles — if they do, this asymmetry will become a standard reported axis rather than a one-off finding.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.