New benchmark shows Claude Mythos and GPT-5.5 can develop real browser exploits autonomously

Carnegie Mellon researchers have developed a benchmark that measures autonomous AI agent capability in discovering and exploiting real V8 engine vulnerabilities. Claude Mythos substantially outperforms GPT-5.5 on this security-focused task, though at significantly higher computational cost. This benchmark signals a critical inflection point: as frontier models gain autonomous reasoning depth, the ability to discover zero-day exploits moves from theoretical concern to measurable capability. The cost-performance tradeoff raises questions about whether capability leadership translates to practical deployment advantage when inference expenses dominate operational budgets.
Modelwire context
Analyst takeThe more consequential detail buried in the cost-performance framing is that inference expense may already be acting as a de facto capability gate: Claude Mythos can do this, but at a price point that limits who can operationalize it, which means the threat model and the accessible threat model are currently different things.
This is largely disconnected from recent activity in our archive, as we have no prior coverage of autonomous security research benchmarks, V8 exploitation, or Carnegie Mellon's work in this area. The story belongs to a cluster of research tracking agentic AI capability in adversarial domains, a space that has been developing mostly outside the product launch cycle that dominates AI coverage. The cost-versus-capability tension here does echo broader infrastructure debates about frontier model deployment, but we cannot draw a direct line to anything we have previously reported.
Watch whether Anthropic or OpenAI respond with explicit usage policy updates targeting security research applications within the next 90 days. If neither acts, that signals the companies are betting on cost as the practical barrier rather than policy controls.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsClaude Mythos · GPT-5.5 · Carnegie Mellon University · Google V8 · The Decoder
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.