Research Models & Releases·arXiv cs.CL·5d ago

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Researchers have identified a critical failure mode in multimodal language models: the ability to detect when text contradicts sensory input. The IMAVB benchmark, spanning 500 video clips with controlled premise conflicts, reveals that eight open-source omnimodal LLMs and Gemini 3.1 Pro exhibit a representation-action gap, where internal representations capture sensory mismatches but models fail to flag them in output. This finding exposes a fundamental grounding weakness in systems marketed as perception-aware agents, suggesting that multimodal alignment remains incomplete despite joint video, audio, and text processing. The gap has immediate implications for deployment in safety-critical domains where hallucination detection is essential.

Modelwire context

Explainer

The troubling detail buried in the findings is that the models are not simply blind to sensory mismatches: their internal representations do capture the conflict, but that signal never surfaces in output. This is a different problem than ordinary hallucination, because the information is present inside the model and still gets suppressed.

This connects directly to coverage from the same day: 'Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry' tackles a structurally similar problem, where the gap between what a model internally represents and what it outputs is the core diagnostic challenge. Both papers are converging on the same uncomfortable finding: output-level evaluation is insufficient, and probing hidden states is necessary to understand model failures. Together, they suggest a quiet but significant shift in how researchers are framing reliability, away from benchmarking outputs and toward interrogating internal geometry.

Watch whether safety-critical deployment frameworks, particularly in robotics or medical imaging, begin requiring hidden-state auditing rather than output-only confidence thresholds as a direct response to findings like this. If IMAVB gets adopted as a standard evaluation by any major model provider within the next two release cycles, that would confirm the benchmark has real traction beyond the paper.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGemini 3.1 Pro · IMAVB · omnimodal LLMs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.