Research Tools & Code·arXiv cs.CL·18h ago

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Researchers have constructed a comprehensive threat taxonomy covering 507 attack vectors against LLMs, then audited six major safety benchmarks against it. The finding is stark: leading frameworks like HarmBench, InjecAgent, and AgentDojo collectively cover only 25% of the identified threat surface, with critical categories like service disruption and model internals entirely absent from standardized evaluation. This work exposes a structural gap in how the field validates LLM robustness, suggesting current benchmarks create a false sense of coverage while leaving significant attack surfaces unexamined. For safety teams and benchmark designers, the implication is clear: existing evaluations are incomplete proxies for real-world resilience.

Modelwire context

Explainer

The paper's most pointed finding is not just that coverage is incomplete, but that the gaps cluster around categories like service disruption and model internals, precisely the attack surfaces that are hardest to probe through behavioral testing alone. That specificity matters more than the headline percentage.

This connects directly to two threads already running on Modelwire. The 'Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands' piece from the same day formalized what it calls the audit gap, arguing that behavioral evaluation cannot inspect latent representations or long-horizon planning. This taxonomy paper is essentially empirical evidence for that same structural failure, showing the gap is not hypothetical but measurable across six real benchmarks. Meanwhile, 'MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs' demonstrated an attack class that targets architectural components rather than inputs, exactly the kind of vector that falls outside current benchmark coverage by design. Together, these three papers describe a coherent problem: the field is testing for threats it already knows how to describe, while novel attack surfaces accumulate outside the frame.

Watch whether HarmBench or AgentDojo publish a formal response or updated coverage roadmap within the next two quarters. If neither benchmark incorporates the taxonomy's uncovered categories into a versioned release, that confirms the audit gap is structural rather than a temporary lag.

Coverage we drew on

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHarmBench · InjecAgent · AgentDojo · STRIDE

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.