How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

A new study exposes a critical blind spot in jailbreak research: the automated judges that measure attack success rates are themselves unreliable and adversarially vulnerable. By validating 596 human-labeled examples, researchers found that dedicated safety classifiers over-flag content while LLM-as-judges produce wildly inconsistent recall (0.06 to 0.65), meaning identical responses score differently depending on which judge evaluates them. This undermines the credibility of published attack-success rates across the field and signals that the benchmarking infrastructure for LLM safety is weaker than assumed.

Modelwire context

Explainer

The deeper problem here is circularity: the field has been using attack-success rates to rank defenses and compare models, but if the judges producing those rates disagree with each other by a factor of ten on recall, then the entire leaderboard logic collapses. The paper doesn't just flag noise, it suggests that published safety comparisons may be ordering models incorrectly.

This connects directly to a pattern Modelwire has been tracking around measurement infrastructure failing to keep pace with capability claims. The 'Fault of Our Stars' piece from the same day made a structurally identical argument in a different domain: numeric scores used as ground truth turn out to be noisy labels, and models trained or evaluated against them inherit that noise silently. The same logic applies here. Where that paper flagged weak labels in sentiment datasets, this paper flags weak labels in safety evaluation, and both point to a shared blind spot: practitioners trusting proxy metrics without validating the proxies themselves.

Watch whether HarmBench or any major safety leaderboard issues a methodology update that standardizes judge selection within the next six months. If they don't, published attack-success rates will continue to be compared across papers that used incompatible judges, making cross-study conclusions unreliable.

Coverage we drew on

Fault of Our Stars: Behavioral Drivers of Rating-Sentiment Incongruence · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHarmBench · LLM-as-judge

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.