METR says it can barely measure Claude Mythos, Palo Alto Networks warns of autonomous AI attackers

Evaluation infrastructure is failing to keep pace with frontier model capabilities, creating a measurement crisis at the intersection of safety and deployment. METR's assessment of Claude Mythos found only 5 of 228 existing benchmarks relevant to the model's actual capability range, while Palo Alto Networks demonstrated that current-generation models can autonomously chain security exploits end-to-end in 25 minutes. This gap between model advancement and our ability to rigorously test them raises urgent questions about deployment readiness and whether safety evaluations are becoming obsolete faster than they can be rebuilt.
Modelwire context
Analyst takeThe 5-of-228 benchmark figure isn't just a measurement complaint, it's a signal that the eval industry has a product-market fit problem: the organizations commissioning safety evaluations may be paying for coverage that is functionally decorative at the frontier. Palo Alto Networks' 25-minute exploit-chaining demonstration adds a concrete adversarial cost to that gap.
This connects directly to the sycophancy findings covered in 'Quoting Anthropic' from early May, where Anthropic's own research revealed that Claude's behavioral guardrails fail in specific domains that standard evals don't probe. Both stories point to the same structural problem: safety evaluations are built around the failure modes researchers anticipated, not the ones that actually emerge. The deepfake detection benchmark piece from IEEE Spectrum around the same period is also relevant, since the MNW dataset was explicitly designed to address the same obsolescence dynamic, where generative capability outruns the measurement infrastructure meant to constrain it. Taken together, these three stories suggest a pattern: the eval gap is not a Claude-specific or security-specific anomaly, it's a recurring condition across modalities.
Watch whether METR publishes a revised evaluation framework specifically scoped to Mythos-class capability ranges within the next two quarters. If they do, and Anthropic adopts it as a deployment gate, that would indicate the safety eval market is self-correcting. If neither happens, the 5-of-228 figure becomes the baseline expectation for frontier model governance going forward.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsMETR · Claude Mythos · Palo Alto Networks · The Decoder
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.