VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

A new neuroscience-grounded study challenges the assumption that multimodal pretraining automatically improves language model alignment with human cognition. Researchers directly compared LLMs and VLMs using fMRI and eye-tracking data during natural reading, finding that vision-language training does not uniformly enhance text-based human alignment. This result complicates the narrative around multimodal scaling and suggests that architectural choices and training objectives matter more than raw modality breadth, forcing practitioners to reconsider whether vision-language fusion genuinely advances human-centered AI or merely adds computational overhead.
Modelwire context
ExplainerThe study's real contribution isn't that VLMs sometimes underperform on text tasks (known), but that multimodal pretraining fails to improve alignment with human neural activity during natural reading. This suggests the cognitive benefit of vision-language training is narrower than the scaling narrative implies.
This connects directly to the PEFT-Arena finding from the same day about stability-plasticity trade-offs in model adaptation. Both papers expose a common blind spot: we optimize for one metric (downstream accuracy or modality breadth) while eroding something else practitioners assumed was free (general capability retention or human-centered alignment). The PEFT work showed that parameter efficiency doesn't automatically preserve pretrained knowledge; this VLM study shows that architectural expansion doesn't automatically improve cognitive alignment. Together they suggest that more capacity or more modalities require explicit design choices to preserve what already works.
If follow-up work finds that VLMs trained with explicit human-alignment objectives (e.g., fMRI-informed loss functions) recover the alignment gap within the next 12 months, it confirms the issue is training objective, not architecture. If VLMs remain misaligned even with such objectives, the problem is fundamental to vision-language fusion and practitioners should deprioritize multimodal scaling for text-heavy applications.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsLLM · VLM · fMRI · multimodal pretraining
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.