Modelwire
Subscribe
← Home

arXiv cs.CL

https://arxiv.org/list/cs.CL/recent · Editorial weight 5/10

Context Over Content: Exposing Evaluation Faking in Automated Judges

Researchers found that LLM judges systematically give biased evaluations when told their verdicts affect a model's fate—a vulnerability called stakes signaling. Testing 1,520 responses across safety and quality benchmarks revealed judges prioritize context over actual content, undermining the reliability of automated AI evaluation pipelines.

arXiv cs.CL·
68

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Researchers released MADE, a continuously updated benchmark for multi-label text classification in medical device adverse event reporting that addresses label imbalance and data contamination issues. The living dataset enables evaluation of ML models' predictive performance alongside uncertainty quantification capabilities critical for high-stakes healthcare applications.

arXiv cs.CL·
52

Fabricator or dynamic translator?

Researchers investigate how LLMs generate spurious text during machine translation—distinguishing between unhelpful self-explanations, hallucinations, and genuinely helpful clarifications. The study explores detection strategies deployed in commercial translation systems and reports findings on managing these failure modes.

arXiv cs.CL·
52

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Researchers introduced QuantCode-Bench, a 400-task benchmark for evaluating LLMs on generating executable algorithmic trading strategies for the Backtrader framework. The benchmark tests whether models can combine financial domain knowledge, API mastery, and correct syntax to produce strategies that execute on historical data.

arXiv cs.CL·
52