Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

Vision-language models routinely generate plausible outputs driven by text priors alone, with images playing no role in the prediction. This 'visual ungroundedness' defeats existing confidence metrics because they cannot distinguish between image-informed and image-agnostic reasoning. BICR addresses this by training a lightweight probe on contrastive hidden states extracted from frozen LVLMs under two conditions: normal inference with images present, and inference with images blacked out. The method surfaces whether a model's confidence reflects genuine visual grounding or mere language pattern matching, a critical diagnostic for production deployments where hallucination risk is high.

Modelwire context

Explainer

The deeper issue BICR surfaces is not just hallucination frequency but hallucination source: a model can produce a correct answer for entirely the wrong reason, and existing confidence scores cannot tell the difference. That distinction matters enormously for any deployment where you need to know whether the model is actually reading the image or confabulating from text patterns.

This connects to a broader reliability gap that WildClawBench (covered the same day) also exposes from a different angle: synthetic evaluation metrics routinely miss failure modes that only appear under realistic conditions. BICR's blind-image contrastive approach is essentially doing the same thing at the inference level that WildClawBench does at the task level, forcing the model into a condition that reveals whether its apparent competence is genuine. Both papers are pushing toward the same conclusion: that standard forward-pass evaluation is insufficient for production credibility.

The probe is trained on frozen LVLM hidden states, so watch whether the method holds when applied to models it was not trained on. If BICR's grounding signal transfers across model families without retraining, it becomes a practical deployment tool; if it requires per-model calibration, its utility narrows considerably.

Coverage we drew on

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsBICR · Large Vision-Language Models · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.