Research Models & Releases·arXiv cs.CL·May 26

EpiCurveBench: Evaluating VLMs on Epidemic Curve Digitization

Researchers have exposed a critical blind spot in vision-language model evaluation: existing chart-reading benchmarks ignore temporal structure and treat minor alignment errors as total failures. EpiCurveBench introduces 1,000 real epidemic curve images paired with EpiCurveSimilarity, a metric that uses dynamic programming to penalize time-series misalignments proportionally rather than catastrophically. Testing six VLMs reveals frontier models still struggle with domain-specific chart extraction when temporal coherence matters, signaling that current benchmarks mask real-world brittleness in multimodal reasoning.

Modelwire context

Explainer

The core insight isn't just that VLMs fail at epidemic curves; it's that standard benchmarks (like ChartQA) treat any misalignment as total failure, hiding the fact that small time-series errors might be acceptable in practice while large ones are catastrophic. EpiCurveSimilarity exposes this by using dynamic programming to grade on a continuum.

This connects directly to the annotation quality work from late May, which showed that seemingly small differences in labeling conditions (timing, fatigue) compound invisibly in aggregate metrics. EpiCurveBench makes the same argument about evaluation infrastructure: aggregate benchmark scores mask brittleness in specific failure modes. The GraphReview paper also pushed for relational context in evaluation rather than treating each artifact in isolation. Here, temporal coherence is that relational structure for time-series data.

If the same six VLMs show substantially different ranking when evaluated on EpiCurveSimilarity versus standard F1 metrics on the same 1,000 images, that confirms the metric actually changes which models look best. If the ranking stays identical, the benchmark is cosmetic. Also watch whether epidemiology teams adopt this for real data validation within the next 18 months; academic benchmarks only matter if practitioners use them.

Coverage we drew on

Temporal Simultaneity Predicts Annotation Quality in Sentiment Corpora · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsEpiCurveBench · EpiCurveSimilarity · Vision-Language Models · ChartQA

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.