Research Tools & Code·arXiv cs.CL·Apr 16

XQ-MEval: A Dataset with Cross-lingual Parallel Quality for Benchmarking Translation Metrics

Researchers introduce XQ-MEval, a benchmark dataset spanning nine language pairs to expose cross-lingual scoring bias in machine translation metrics. The dataset uses semi-automatic error injection and native speaker validation to ensure parallel-quality translations, addressing a gap in systematic evaluation of multilingual systems.

Modelwire context

Explainer

The core problem XQ-MEval targets is subtle but consequential: existing translation metrics may score the same quality of error differently depending on which language pair is being evaluated, meaning a metric that looks reliable on English-German may quietly underperform on lower-resource pairs without anyone noticing.

This lands squarely in a cluster of benchmark-reliability concerns that dominated coverage on April 16. The 'Context Over Content' paper on LLM judges found that automated evaluators respond to contextual framing rather than actual output quality, and the 'Diagnosing LLM Judge Reliability' piece showed that aggregate consistency scores can mask per-instance logical failures in roughly one-third to two-thirds of documents. XQ-MEval raises an analogous concern one layer down: if the metrics used to train and select translation systems are themselves biased by language pair, then every downstream evaluation built on those metrics inherits that distortion. The 'Fabricator or dynamic translator' paper on MT hallucinations is also adjacent, since detecting spurious output depends on having reliable quality signals in the first place.

Watch whether MQM-based metrics show measurable score variance across the nine language pairs when applied to XQ-MEval's injected-error set. If the variance is large on low-resource pairs but small on high-resource ones, that confirms the bias is systematic rather than noise.

Coverage we drew on

Context Over Content: Exposing Evaluation Faking in Automated Judges · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsXQ-MEval · MQM

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.