Research Tools & Code·arXiv cs.CL·21h ago

Lost in Translation: Do LVLM Judges Generalize Across Languages?

Researchers released MM-JudgeBench, a 60K-instance multilingual evaluation dataset spanning 25 languages to test whether vision-language reward models generalize beyond English. The benchmark reveals critical gaps in how LVLM judges perform across linguistic and cultural contexts, directly impacting alignment evaluation practices.

Modelwire context

Explainer

The deeper issue isn't just benchmark coverage: if reward models used in RLHF pipelines systematically underperform on non-English inputs, then alignment training itself is skewed toward English-language behavior, meaning safety and quality properties may not transfer to the languages most users actually speak.

This fits directly into a pattern Modelwire has been tracking since mid-April around the fragility of automated evaluation. The piece 'Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations' (arXiv, April 16) showed that even in English, LLM judges exhibit logical inconsistencies in one-third to two-thirds of pairwise comparisons. MM-JudgeBench extends that reliability problem across 25 languages, which compounds the concern considerably. Separately, 'Context Over Content: Exposing Evaluation Faking in Automated Judges' (April 16) found judges respond to contextual framing rather than actual response quality. Taken together, these three papers describe a judge layer that is unreliable in English, gameable by context, and now demonstrably inconsistent across languages. That is a meaningful accumulation of evidence against treating automated judges as ground truth.

Watch whether major RLHF-dependent labs (Anthropic, Google DeepMind, or OpenAI) cite MM-JudgeBench in forthcoming model cards or alignment reports within the next two quarters. Adoption there would signal the benchmark has moved from academic critique to operational constraint.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMM-JudgeBench · VL-RewardBench · OpenCQA · LVLMs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.