NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

NatureBench exposes a critical gap in AI agent capabilities on real scientific problems. Researchers constructed a 90-task benchmark from peer-reviewed Nature papers, using containerized environments to eliminate the reproducibility fragmentation that has plagued prior agent evaluations. Testing ten frontier configurations, the strongest agent cleared only 17.8% of tasks, suggesting that despite recent hype around coding agents, they remain far from autonomous discovery on complex, multi-disciplinary research. The finding matters because it establishes a rigorous, standardized measurement for agent progress on tasks that matter beyond toy benchmarks, forcing the field to confront whether current scaling approaches can bridge the gap between code generation and scientific reasoning.
Modelwire context
ExplainerThe 17.8% ceiling isn't just a low score, it's a ceiling reached by ten frontier configurations tested under identical, containerized conditions, meaning the failure isn't attributable to evaluation noise or setup variance. The benchmark's value is in what it rules out: the usual excuses.
NatureBench arrives on the same day as several other benchmark papers in our coverage, and the pattern is worth naming. CN-NewsTTS Bench, AdversaBench, and MEMPROBE all share the same underlying motivation: existing evaluations are too fragile or too narrow to surface real failure modes. NatureBench extends that logic to scientific reasoning specifically, where the stakes of a false positive are highest. The Qwen-AgentWorld paper from the same day is the most direct counterpoint: Alibaba is scaling world models toward autonomous agent planning, yet NatureBench's results suggest the reasoning substrate those agents depend on still breaks down on peer-reviewed multi-step science tasks.
Watch whether any of the ten tested agent configurations, or a successor using Qwen-AgentWorld-style world modeling, crosses 30% on NatureBench within the next two release cycles. That threshold would indicate genuine progress rather than benchmark-specific tuning.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsNatureBench · NatureGym · Nature
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.