Research Models & Releases·arXiv cs.CL·5d ago

Can OCR-VLMs Read Devanagari? A Stress-Test Benchmark and Post-Correction Study

A new benchmark reveals a critical blind spot in multimodal AI systems: most OCR and vision-language models perform well on clean English and Chinese text but fail dramatically on Devanagari script under real-world degradation. Testing ten systems from EasyOCR to GPT-5.5 and Claude Opus shows that specialized OCR-VLMs, despite their focus, are surprisingly fragile compared to frontier closed models. This exposes a systematic gap in how the industry evaluates and trains vision systems, suggesting that strong performance on dominant languages masks poor generalization to non-Latin scripts that affect billions of users globally.

Modelwire context

Explainer

The more pointed finding is that OCR-specialized models, systems built explicitly for text recognition, are outperformed by general-purpose frontier models on degraded Devanagari, which suggests the specialization is narrower than advertised and likely optimized around Latin and CJK script test sets.

This connects directly to a pattern Modelwire has been tracking across multiple papers from this same window. The KrishokChat paper (story 4) flagged how English-centric benchmarks mask real performance gaps for the billions of users whose languages sit outside dominant training distributions. That work focused on Bengali agricultural text; this benchmark makes the same structural argument but targets vision systems rather than language models. Both papers are pointing at the same upstream problem: evaluation suites that reward performance on high-resource scripts while leaving low-resource script failure invisible until a stress test forces it into view. The diffusion LLM evaluation piece (story 1) adds a third angle, showing that benchmark rankings shift with methodology, which means the OCR field's confidence in its own leaderboards may be doubly fragile.

Watch whether Qwen or DeepSeek release updated model cards that include Devanagari degradation benchmarks within the next two quarters. If they do, it signals the research community's pressure is reaching training pipeline decisions; if they don't, this benchmark risks becoming a citation without consequence.

Coverage we drew on

KrishokChat: A Citation-Grounded Dataset and Benchmark for Bengali Agricultural Advisory · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsEasyOCR · Qwen2.5-VL-3B · Qwen3-VL-8B · DeepSeek-OCR · Gemini 2.5 Flash · Claude Opus 4.7

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.