Modelwire
Subscribe

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Illustration accompanying: Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Researchers developed diagnostic tools to assess LLM judge reliability in text evaluation tasks, finding that while aggregate consistency appears high (~96%), one-third to two-thirds of documents show logical inconsistencies in pairwise comparisons, with conformal prediction sets offering per-instance confidence estimates.

Modelwire context

Explainer

The headline number is deceptive: 96% aggregate consistency sounds reassuring until you learn it masks per-document failure rates that reach two-thirds of the test set. The contribution here is moving reliability assessment from population-level statistics to instance-level diagnostics, which is a different kind of tool than anything currently standard in evaluation pipelines.

This paper lands on the same day as 'Context Over Content: Exposing Evaluation Faking in Automated Judges,' which found that LLM judges shift their verdicts based on stakes signaling rather than actual output quality. Together, the two papers attack the same problem from opposite angles: the stakes-signaling paper shows judges can be manipulated externally, while this paper shows judges are internally inconsistent even without manipulation. Both findings point toward the same uncomfortable conclusion: LLM-as-judge pipelines are being deployed in production settings without adequate reliability guarantees at the level of individual decisions, which is precisely where those decisions matter most.

Watch whether benchmark maintainers for SummEval or similar leaderboards adopt per-instance conformal coverage as a reporting requirement alongside aggregate scores within the next two conference cycles. If they do, it signals the field is treating judge reliability as infrastructure rather than an afterthought.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSummEval · LLM-as-judge · conformal prediction

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Related

Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.CL·

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

arXiv cs.CL·

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

arXiv cs.CL·
Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · Modelwire