What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema

Reproducibility in agent benchmarking has become a silent crisis. This audit of twelve canonical papers reveals systematic gaps in how evaluation methodology gets reported: benchmark versions, inference hyperparameters, cost breakdowns, and failure modes often remain undocumented or locked behind unavailable artifacts. The authors propose a five-field scoring schema to standardize disclosure. For practitioners, this matters because identical model names produce incomparable results across papers, making it impossible to track real progress or debug performance claims. The work surfaces a structural problem in how the field validates agent capabilities, directly affecting how researchers and engineers interpret benchmark leaderboards.
Modelwire context
ExplainerThe audit's sharpest finding isn't that papers are sloppy but that the problem is structural: benchmark versioning and inference hyperparameters are so inconsistently reported that two papers citing the same model on the same benchmark may be measuring entirely different things, with no way to reconcile them after the fact.
This connects directly to coverage from the same week on 'Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling,' which proposes a new agentic architecture and implicitly relies on benchmark comparisons to establish its value. If the evaluation infrastructure those comparisons rest on is as inconsistent as this audit suggests, claims about latency and reliability improvements become harder to verify independently. More broadly, the reproducibility problem cuts across nearly every agent paper in the current publication cycle, including the roto 2.0 tactile benchmark work, which at least partially addresses the disclosure gap by open-sourcing environments and tuned baselines. That open-sourcing approach is essentially what the five-field schema is asking all benchmark papers to do.
Watch whether any of the twelve audited papers issue updated disclosures or supplementary materials within the next three months. Voluntary adoption would signal the schema has traction; silence would suggest the field needs a venue-level mandate before anything changes.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsLLM agents · benchmark evaluation · reproducibility schema
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.