
Multimodal QUD: Inquisitive Questions from Scientific Figures
Researchers have constructed a benchmark for evaluating vision-language models on their ability to generate curiosity-driven questions about scientific figures in context, moving beyond simple information extraction. The work addresses a gap in VLM evaluation: current benchmarks test surface-level visual comprehension, but scientific communication requires models to understand authorial intent and generate questions that probe deeper insights. This matters because it exposes whether VLMs can reason about multimodal scientific discourse the way humans do when reading papers, and it signals where next-generation evaluation frameworks need to focus as models become more sophisticated at handling complex, domain-specific visual reasoning.58




























