Research Models & Releases·arXiv cs.CL·Jun 24

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

Researchers have exposed a critical vulnerability in vision-language models: their OCR reasoning capabilities degrade sharply under visual corruption, yet this fragility remains largely unmeasured. The new OCR-Robust benchmark systematically evaluates how VLMs handle degraded document images, scene text, charts, and tables, revealing gaps between lab performance and real-world robustness. This matters because production deployments of document AI, receipt scanning, and form processing rely on these models in noisy, low-quality capture environments where visual perturbations are inevitable. The finding signals that current VLM benchmarks may overstate practical reliability.

Modelwire context

Explainer

The benchmark's contribution isn't just identifying that VLMs degrade under noise, which practitioners already suspected, but that it provides a structured taxonomy of perturbation types across document categories, making it possible to compare models on a consistent axis that didn't previously exist.

This fits directly into a pattern Modelwire has been tracking: frontier models performing well on clean benchmarks while failing on structural reliability properties that matter in deployment. The order-sensitivity audit covered in 'Same Evidence, Different Answer' found flip rates of 24-50% across 18 models when input sequences were shuffled, and the mechanism there is analogous here. Both papers are measuring the gap between controlled evaluation and real-world variance, just along different input dimensions. Taken together, they suggest the benchmark ecosystem systematically rewards performance on idealized inputs while leaving robustness to distribution shift unmeasured.

Watch whether OCR1.0 and OCR2.0 model families show meaningfully different degradation curves on the benchmark's scene-text versus structured-document splits. If the gap between those two categories is large, it points to architecture choices rather than training data as the primary driver of fragility.

Coverage we drew on

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsVision-Language Models · OCR-Robust · OCR1.0 · OCR2.0

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.