Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue

Vision-language models exhibit a systematic failure in collaborative dialogue: they overestimate mutual understanding when given visual access to shared content. Researchers tested VLMs on reference-matching tasks from real dialogue corpora, finding that authentic images paradoxically push models toward false confidence in alignment, while even non-visual text descriptions trigger the same bias. This reveals a fundamental gap between perception and pragmatic reasoning. The finding matters because deployed VLMs in interactive settings may confidently misinterpret user intent, creating failure modes invisible in standard benchmarks that lack asymmetric information constraints.

Modelwire context

Explainer

The paper isolates a failure mode orthogonal to safety or capability: VLMs don't just make errors, they express false confidence in understanding when visual grounding is present. This isn't about what models know but about their metacognitive miscalibration in interactive settings.

This connects directly to the multimodal safety work from late June (MARS paper on refusal directions), which showed that safety properties transfer across modalities. That work assumed VLMs could be steered reliably once aligned. This dialogue study reveals a prior problem: VLMs may confidently misinterpret user intent before safety guardrails even activate. The asymmetry finding also echoes the dysarthric ASR case study, where personalized adaptation revealed that foundation models trained on general data fail on underrepresented distributions. Here, the 'underrepresented distribution' is asymmetric dialogue, a core real-world constraint that standard benchmarks don't capture.

If teams deploying VLMs in collaborative settings (e.g., image annotation, visual Q&A with humans) report higher misalignment rates than supervised benchmarks predict, that confirms this is a deployment hazard, not just a research artifact. Watch whether the next generation of VLM evals explicitly includes asymmetric information tasks by Q4 2026.

Coverage we drew on

Harnessing Textual Refusal Directions for Multimodal Safety · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsVision-Language Models · HCRC MapTask · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.