Document-as-Image Representations Fall Short for Scientific Retrieval

Researchers challenge the document-as-image paradigm dominating scientific retrieval benchmarks, arguing that rendering papers as pixels obscures structured content like tables and equations. They introduce ArXivDoc, a new benchmark built from LaTeX sources to better evaluate how models handle text-rich multimodal documents.
Modelwire context
ExplainerThe deeper provocation here is not just that pixel-based retrieval benchmarks are imperfect, but that the entire ViDoRe benchmark family may be measuring rendering quality as a proxy for comprehension, which means models optimized on those leaderboards could be systematically bad at the structured content that actually appears in scientific literature.
This connects most directly to the LLM evaluation reliability thread running through recent Modelwire coverage. The 'Context Over Content: Exposing Evaluation Faking in Automated Judges' piece from April 16 documented how evaluation pipelines can be gamed or distorted at the judge layer; this paper surfaces an analogous problem one step earlier, at the benchmark construction layer itself. Both stories point toward the same uncomfortable conclusion: the scaffolding used to measure progress in AI is fragile in ways that compound quietly. The Codex coverage from the same week is not meaningfully connected here. This story belongs to a slower-moving conversation about scientific AI tooling and retrieval-augmented research assistants.
Watch whether teams building retrieval systems for scientific literature, particularly those using ViDoRe as a primary eval, begin adopting ArXivDoc as a secondary benchmark within the next two conference cycles. Adoption by even one major retrieval lab would signal the critique has traction beyond the paper itself.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsArXivDoc · ArXivQA · ViDoRe
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.