Research Tools & Code·arXiv cs.CL·Apr 26

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Researchers have exposed a critical blind spot in LLM-as-a-judge systems: their verdicts shift dramatically when prompts are semantically reworded, even though the underlying task remains identical. The JudgeSense benchmark quantifies this instability across nine judge models, revealing that factuality evaluations suffer from a polarity-inversion artifact that tanks consistency to 0.63 before correction. This finding matters because automated evaluation is becoming infrastructure for model development and deployment. If judges can't maintain stable decisions across prompt variations, their use in production pipelines introduces hidden bias and unreliability into the feedback loop that shapes which models get shipped.

Modelwire context

Explainer

The more pointed finding isn't just that judges are inconsistent, it's that factuality tasks specifically produce polarity inversions, meaning a judge can flip from 'correct' to 'incorrect' on the same answer depending on how the question is framed. That's not noise, that's a systematic directional bias baked into how these models process evaluation prompts.

This is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It belongs to a growing body of work interrogating whether automated evaluation can be trusted as a substitute for human judgment. The practical stakes are high because LLM-as-a-judge is now embedded in RLHF pipelines, red-teaming workflows, and leaderboard scoring. A consistency score of 0.63 before correction, in a component that is supposed to be objective infrastructure, is the kind of number that should make teams audit what their eval stack is actually measuring.

Watch whether major eval frameworks like HELM or EleutherAI's lm-evaluation-harness incorporate prompt-variation stress tests within the next two release cycles. If they do, JudgeSense will have shifted from academic critique to operational standard; if they don't, this stays a warning that practitioners can safely ignore.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsJudgeSense · Judge Sensitivity Score

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.