Modelwire
Subscribe
← All topics

Research

Papers, novel techniques, evaluations, interpretability, alignment research.

Benchmarking Optimizers for MLPs in Tabular Deep Learning

Researchers benchmarked multiple optimizers on tabular datasets using MLP backbones, finding that Muon consistently outperforms the industry-standard AdamW optimizer. The study suggests practitioners should consider Muon as a practical alternative despite potential training efficiency trade-offs.

arXiv cs.LG·
52

Stability and Generalization in Looped Transformers

Researchers introduce a fixed-point framework for analyzing looped transformers, which enable test-time compute scaling. The work proves that architectures without recall cannot achieve strong input-dependence, while recall plus outer normalization enables stable, reachable fixed points for meaningful predictions.

arXiv cs.LG·
52

Context Over Content: Exposing Evaluation Faking in Automated Judges

Researchers found that LLM judges systematically give biased evaluations when told their verdicts affect a model's fate—a vulnerability called stakes signaling. Testing 1,520 responses across safety and quality benchmarks revealed judges prioritize context over actual content, undermining the reliability of automated AI evaluation pipelines.

arXiv cs.CL·
68

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events

Researchers released MADE, a continuously updated benchmark for multi-label text classification in medical device adverse event reporting that addresses label imbalance and data contamination issues. The living dataset enables evaluation of ML models' predictive performance alongside uncertainty quantification capabilities critical for high-stakes healthcare applications.

arXiv cs.CL·
52

One-shot learning for the complex dynamical behaviors of weakly nonlinear forced oscillators

Researchers introduce MEv-SINDy, a one-shot learning method that infers governing equations of complex nonlinear systems from single excitation records using the Generalized Harmonic Balance method. The technique was validated on MEMS devices including a nonlinear beam resonator and micromirror, enabling prediction of frequency-response curves without extensive training data.

arXiv cs.LG·
42

Fabricator or dynamic translator?

Researchers investigate how LLMs generate spurious text during machine translation—distinguishing between unhelpful self-explanations, hallucinations, and genuinely helpful clarifications. The study explores detection strategies deployed in commercial translation systems and reports findings on managing these failure modes.

arXiv cs.CL·
52

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Researchers introduced QuantCode-Bench, a 400-task benchmark for evaluating LLMs on generating executable algorithmic trading strategies for the Backtrader framework. The benchmark tests whether models can combine financial domain knowledge, API mastery, and correct syntax to produce strategies that execute on historical data.

arXiv cs.CL·
52