Modelwire
Subscribe

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Illustration accompanying: HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken through draft-based co-authoring attacks where malicious users seed incomplete documents with harmful content to force unsafe completions. The work systematically evaluates model robustness across high-risk domains including explosives, drugs, weapons, and cyberattacks.

Modelwire context

Explainer

The key distinction HarDBench draws is that co-authoring attacks exploit the model's role as a completer rather than a responder: the harmful payload is seeded in the user's draft, not in an explicit instruction, which means standard prompt-level safety filters may not catch it at all.

Modelwire has been tracking a wave of domain-specific benchmarks this month, including IndiaFinBench and QuantCode-Bench, but those measure capability. HarDBench sits in a different category: it measures where safety breaks down under realistic usage conditions. The WIRED piece from April 17 on AI-assisted drafting in newsrooms is the more relevant adjacent story here, because it describes exactly the workflow HarDBench stress-tests: a human seeds a document and an LLM completes it. That piece framed the concern as editorial quality and labor; this paper surfaces a harder problem, which is that the same workflow can be weaponized. Together they suggest that co-authoring tools are being deployed before the safety research has caught up.

Watch whether major co-authoring tool vendors (Microsoft Copilot, Google Docs AI, Notion AI) acknowledge HarDBench or publish any red-teaming results against draft-completion attack vectors within the next six months. Silence from that group would confirm the gap the paper implies.

Coverage we drew on

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHarDBench · LLMs

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing · Modelwire