Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

Researchers tested automatic prompt optimization on legal QA evaluation, finding that AI judges trained with lenient feedback criteria outperform strict baselines and generalize better across different judge models. The ProTeGi method consistently beat human-designed prompts on the LEXam benchmark using Qwen3 and DeepSeek judges.

Modelwire context

Explainer

The buried finding here is directional: the researchers didn't just show that prompts affect scores, they showed that optimizing for leniency produces prompts that generalize better across different judge models, which suggests the optimization is capturing something structural about how these models reason rather than overfitting to one judge's quirks.

This connects directly to a cluster of judge-reliability work Modelwire covered in mid-April. The 'Diagnosing LLM Judge Reliability' piece from April 16 found that one-third to two-thirds of documents show logical inconsistencies in pairwise comparisons even when aggregate consistency looks high. That paper treated judge unreliability as a diagnostic problem. This new paper treats it as an optimization target, which is a meaningful shift in framing. The 'Context Over Content' paper from the same week showed judges can be manipulated through stakes signaling. Prompt optimization that systematically tunes judge disposition is a related but distinct attack surface: it doesn't require deceiving the judge about context, it just reshapes what the judge is asked to reward.

Watch whether ProTeGi-optimized prompts hold up when tested against the transitivity-violation diagnostic from the April 16 conformal prediction paper. If lenient prompts also reduce logical inconsistency rates, that strengthens the generalization claim. If they don't, the benchmark gains may reflect calibration drift rather than genuine quality signal.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsQwen3-32B · DeepSeek-V3 · LEXam · ProTeGi

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.