Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

A new evaluation reveals a counterintuitive weakness in vision-language models: adding real images to lexical judgment tasks often degrades performance rather than improving it, particularly when visual context is irrelevant to the semantic task. Using human concreteness and imagery ratings as a benchmark, researchers found that VLMs struggle to filter spurious visual signals from task-relevant information, suggesting the field's assumption that multimodal inputs universally enhance understanding may be flawed. This finding has implications for how practitioners design VLM applications and where visual grounding genuinely adds value versus introduces noise.

Modelwire context

Explainer

The study uses human psycholinguistic ratings, specifically concreteness and imagery scores, as the benchmark rather than task-accuracy proxies, which means the failure mode being measured is semantic in nature: models are not misidentifying objects, they are letting visual salience override lexical judgment in ways humans do not.

This connects directly to the MATCHA paper covered the same day, which argued that embedding-based metrics mask critical model failures by failing to distinguish semantic contradictions. Both papers are pointing at the same underlying problem from different directions: current evaluation frameworks do not adequately capture when models are processing meaning versus processing surface signal. The concreteness study adds a multimodal dimension to that critique, showing that irrelevant visual input can actively degrade semantic reasoning rather than simply failing to help. Together, they suggest the field is accumulating evidence that input richness and semantic fidelity are not the same thing, and that practitioners conflating the two will build systems that fail in non-obvious ways.

If follow-up work shows this degradation pattern holds specifically on abstract or relational concepts (low-concreteness words) but not on high-concreteness nouns, that would give practitioners a concrete filtering heuristic for when to suppress visual context in VLM pipelines.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsVision-Language Models · Multimodal Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.