Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Researchers introduce CrossMath, a benchmark that isolates vision reasoning in multimodal models by presenting identical problems in text-only, image-only, and combined formats. The work challenges whether VLMs genuinely reason over visual input or simply leverage their text backbone's reasoning capabilities.

Modelwire context

Explainer

The deeper provocation here isn't just that VLMs might be bad at visual reasoning — it's that current benchmarks can't tell the difference between a model that genuinely reads an image and one that pattern-matches from its text training. CrossMath's controlled design, holding problem content constant across three formats, is what makes that distinction testable rather than philosophical.

This connects directly to a cluster of benchmark-skepticism work Modelwire has been tracking. The piece on 'Diagnosing LLM Judge Reliability' from April 16 found that surface-level consistency metrics (~96% aggregate) masked logical failures in a third to two-thirds of individual cases — a similar warning about trusting aggregate scores over per-instance behavior. The broader pattern: researchers are increasingly building second-order tools that audit whether evaluation methods themselves are measuring what they claim to measure. CrossMath belongs to that same corrective tradition, applied specifically to the vision modality.

Watch whether frontier VLM developers (Google, OpenAI, Anthropic) run their current models against CrossMath and publish disaggregated scores by format. If image-only performance consistently trails text-only on identical problems, that confirms the modality gap is structural, not incidental to specific architectures.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCrossMath · Vision-Language Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.