An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

Epoch AI's MirrorCode benchmark reveals a critical frontier in AI capability: reverse-engineering complete software systems from behavior alone. Claude Opus 4.7 achieved a 56 percent solve rate, reconstructing a 16,000-line toolkit in 14 hours, but the benchmark exposes a hard ceiling on complex tasks where all tested models fail. The $2,600 cost and 19-day runtime on individual problems signal both the computational intensity of this capability class and the gap between narrow wins and production-grade code synthesis. This matters for security teams, software auditing, and anyone tracking whether LLMs can move beyond pattern completion into genuine program reconstruction.
Modelwire context
ExplainerThe more telling number isn't the 56 percent solve rate on simpler tasks but the hard zero: no model tested could complete the benchmark's most complex problems at any price, which means the $2,600 figure represents a ceiling being probed, not a floor being optimized.
This is largely disconnected from recent activity in our archive, as we have no prior coverage of MirrorCode, Epoch AI's benchmarking work, or the broader software-synthesis capability class. That absence is itself worth noting: most AI benchmark coverage gravitates toward reasoning and language tasks, and infrastructure-level evaluations like code reconstruction tend to surface later in the coverage cycle, often only after a security or audit use case forces the issue.
Watch whether Epoch AI releases a versioned leaderboard that tracks model-over-model progress on the failing hard tasks specifically. If a future model cracks even one of those zero-score problems at comparable cost, that is the signal that the ceiling is moving rather than fixed.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsEpoch AI · MirrorCode · Claude Opus 4.7 · The Decoder
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.