Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

Illustration accompanying: Can LLMs Judge Better Than They Generate? Evaluating Task Asymmetry, Mechanistic Interpretability and Transferability for In-Context QA

A new study challenges the foundational assumption behind LLM-as-a-Judge systems: that evaluation is inherently easier than generation. Testing across four QA benchmarks, researchers found generation accuracy actually exceeds self-evaluation performance in most cases, with attention analysis showing evaluators spend 3-5x less time reading context and candidate answers than generators do. This finding has immediate implications for anyone building evaluation pipelines or relying on model self-critique for quality control, suggesting the asymmetry may be fundamental rather than a training artifact.

Modelwire context

Explainer

The attention analysis is the part worth sitting with: evaluators allocating 3-5x less processing to the very context they're supposed to judge suggests this isn't a fixable calibration problem but something closer to a structural limitation in how evaluation is computed internally.

The mechanistic framing here connects directly to the Vision-Default, Prior-Override paper covered the same day, which found that only 2.5-4.8% of attention heads govern knowledge override in VLMs. Both papers are doing the same kind of work: using attention patterns to explain why a model fails at a task that looks easy from the outside. Together they reinforce a broader pattern in recent interpretability research where sparse, localized attention behavior turns out to explain capability gaps that benchmark numbers alone obscure. That matters practically because anyone relying on LLM-as-a-Judge for QA pipelines, including the truth-fusion workflows described in the Single and Multi Truth Data Fusion paper from the same day, is now building on an evaluator that may be systematically under-reading the evidence it's supposed to weigh.

Watch whether LoRA fine-tuning on evaluation-specific attention patterns closes the generation-evaluation gap on MuSiQue specifically. If it does not, that confirms the asymmetry is architectural rather than a training data artifact.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSQuAD 2.0 · DROP · HotpotQA · MuSiQue · LoRA

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.