Research Models & Releases·arXiv cs.CL·4d ago

SHOVIR: A Benchmark for Evaluating Vision Shortcut Learning in Radiology Report Generation

Researchers have identified a critical blind spot in how radiology AI systems are evaluated. Current benchmarks reward models for generating clinically plausible reports even when those outputs don't reflect actual pathology visible in images, a failure mode enabled by learned statistical shortcuts rather than genuine visual reasoning. SHOVIR, a new benchmark built on spatially annotated chest X-ray datasets with disease-level labels, forces models to prove diagnostic claims are grounded in image evidence through targeted occlusion experiments. This work exposes a fundamental gap between metric performance and clinical reliability in vision-language models, with direct implications for deployment safety in medical imaging.

Modelwire context

Explainer

The deeper problem SHOVIR surfaces is not just that models hallucinate findings, but that existing evaluation pipelines have no mechanism to distinguish a correct report generated from genuine image reasoning from one generated from dataset priors. Spatial occlusion is the forcing function that makes that distinction testable.

This connects directly to the EvalSafetyGap framework covered the same day, which argues that evaluation metrics and safety signals can show improvement while the underlying capabilities they are supposed to measure remain unverified. SHOVIR is essentially a domain-specific instantiation of that exact failure mode: radiology VLMs score well on standard metrics while the measurement instrument itself is blind to shortcut learning. The distributionally robust reconstruction work from the same period adds a complementary angle, since both papers are responding to the same underlying problem of models that perform well in training distributions but fail when deployment conditions change.

Watch whether the major radiology VLM developers (Microsoft, Google, Nuance) adopt SHOVIR in published evaluations within the next 12 months. If they do not, that silence will itself be informative about how the field weighs benchmark inconvenience against deployment safety claims.

Coverage we drew on

EvalSafetyGap: A Hybrid Survey and Conceptual Framework for LLM Evaluation-Safety Failures · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSHOVIR · MIMIC-CXR · PadChest-GR · CheXpert · Vision-Language Models · Radiology Report Generation

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.