Research Opinion & Analysis·Simon Willison·21h ago

What happened after 2,000 people tried to hack my AI assistant

Fernando Irarrázaval's public red-teaming experiment exposed a critical gap between prompt-injection resilience claims and real-world robustness. Over 6,000 adversarial attempts against an Opus 4.6 instance with explicit anti-injection rules failed to extract secrets, suggesting either that modern LLM safeguards are holding under sustained attack or that the test's constraints were too narrow to surface vulnerabilities. The finding matters because it challenges both the doomsday narrative around prompt injection and the assumption that simple rule-based defenses suffice, forcing the field to recalibrate expectations around LLM security posture at scale.

Modelwire context

Skeptical read

The experiment's headline number, 6,000 adversarial attempts with zero secret extraction, obscures a more important question: what attack categories were actually attempted, and were the most sophisticated multi-turn, context-manipulation techniques represented in that pool of 2,000 participants, most of whom were likely casual rather than expert adversaries.

This is largely disconnected from recent activity in our archive, as we have no prior coverage of prompt injection research, red-teaming methodology, or Anthropic's Opus model line to anchor it against. It belongs to a broader conversation in the security research community about whether public red-teaming exercises produce generalizable findings or simply measure the resilience of a specific configuration against a self-selected, non-expert crowd. That distinction matters enormously when interpreting the result.

Watch whether Irarrázaval or an independent party publishes a breakdown of attack taxonomy from the attempt logs. If the dataset shows fewer than 5% of attempts used multi-turn or indirect injection strategies, the robustness claim weakens considerably and the experiment tells us more about attacker skill distribution than about model defenses.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsFernando Irarrázaval · OpenClaw · Anthropic Claude Opus 4.6 · hackmyclaw.com

Read full story at Simon Willison →(simonwillison.net)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on simonwillison.net. If you’re a publisher and want a different summarization policy for your work, see our takedown page.