Research Models & Releases·arXiv cs.CL·3d ago

Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2

A new solver for ARC-AGI-2 challenges the assumption that better reasoning traces lead to better answers. By generating diverse candidate solutions across text, image, and code modalities independently, then using a single judge model to compare all traces holistically, the approach recovers correct minority answers where consensus methods fail. This shifts the bottleneck from generation to selection, a critical insight for few-shot visual reasoning that has implications for how multimodal systems should handle high-stakes tasks where confidence and correctness diverge.

Modelwire context

Explainer

The paper's most underreported contribution is the holistic judging mechanism itself: rather than scoring traces individually and taking a majority vote, a single judge evaluates all candidates together, which means the judge's own calibration becomes the critical variable. The system's ceiling is now bounded by how well the judge distinguishes correct-but-unusual answers from plausible-but-wrong ones.

This connects directly to the CLExEval paper covered the same day, which found that fluency masking reasoning errors is an unsolved problem in high-stakes domains. Both papers are essentially documenting the same failure mode from different angles: a model can produce confident, coherent output while being wrong, and standard scoring methods won't catch it. The ARC-AGI-2 work proposes holistic judging as a partial fix, but CLExEval's finding that verbosity bias can collapse accuracy from 95% to 32.5% suggests that judge models face the same susceptibility. The localized conformal prediction work (story 3) is also relevant here, since per-sample confidence calibration is precisely what a robust holistic judge would need.

Watch whether the ARC Prize leaderboard shows this approach holding its advantage as ARC-AGI-2 test sets are refreshed with harder visual abstractions. If the selection gain shrinks when generation diversity is constrained by harder prompts, that confirms the bottleneck has simply moved rather than been resolved.

Coverage we drew on

CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsARC-AGI-2 · Large Language Models · ARC Prize

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.