MathDuels: Evaluating LLMs as Problem Posers and Solvers

Researchers introduce MathDuels, a self-play benchmark where LLMs both author and solve math problems to differentiate model capabilities beyond static benchmarks. The framework uses adversarial prompting and a Rasch model to jointly measure solver ability and problem difficulty, addressing ceiling effects in existing evaluations.
Modelwire context
ExplainerThe deeper methodological bet in MathDuels is that problem difficulty shouldn't be a fixed label assigned by humans before testing, but something estimated jointly from how models perform against each other. That's a meaningful departure from how most math benchmarks are constructed, and it's what the Rasch model is doing: treating difficulty as a latent variable inferred from outcomes, not a precondition.
The ceiling-effect problem MathDuels addresses is part of a broader pattern in LLM evaluation that Modelwire has been tracking obliquely. The ASR evaluation piece from arXiv cs.CL on April 23rd makes a structurally similar argument: that traditional metrics (Word Error Rate there, static problem sets here) systematically underreport what capable models can actually do, and that generative models may be better evaluators than the fixed rubrics built before them. Neither paper cites the other, but together they suggest a quiet consensus forming around the idea that evaluation itself needs to become dynamic. The math benchmark space specifically is largely disconnected from recent coverage on the site, which has focused on tooling, hardware, and developer productivity.
Watch whether frontier labs (OpenAI, Anthropic, Google DeepMind) adopt MathDuels as a third-party eval in their next model release documentation. Adoption by even one within six months would signal the field is taking adversarial self-play benchmarking seriously as a replacement for saturated static sets.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsMathDuels · Rasch model
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.