Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Researchers developed diagnostic tools to assess LLM judge reliability in text evaluation tasks, finding that while aggregate consistency appears high (~96%), one-third to two-thirds of documents show logical inconsistencies in pairwise comparisons, with conformal prediction sets offering per-instance confidence estimates.

MentionsSummEval · LLM-as-judge · conformal prediction

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.CL·2d ago

Research

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

arXiv cs.CL·2d ago

Research

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

arXiv cs.CL·2d ago

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Related

Context Over Content: Exposing Evaluation Faking in Automated Judges

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning