SafeReview: Defending LLM-based Review Systems Against Adversarial Hidden Prompts

Researchers have developed SafeReview, a dual-model framework that treats LLM-based peer review as an adversarial game between attack and defense. A Generator learns to craft hidden prompts that manipulate review outcomes, while a Defender learns to detect them through co-evolutionary training inspired by generative adversarial networks. The work exposes a critical vulnerability in deploying LLMs for high-stakes scholarly gatekeeping, where adversarial submissions could bias acceptance decisions. This matters because academic peer review is moving toward LLM assistance without robust safeguards, and the paper demonstrates that naive systems remain exploitable. The framework's iterative arms race approach offers a template for hardening other LLM-integrated workflows against prompt injection attacks.
Modelwire context
ExplainerThe more pointed finding here is not that LLMs can be manipulated, which is well established, but that the attack surface in peer review is unusually asymmetric: a submitting author has strong incentive and ample opportunity to craft adversarial content, while the reviewing system has no prior signal that an attack is even occurring.
This is largely disconnected from recent activity in our archive, as Modelwire has not yet covered LLM-assisted peer review or prompt injection defenses. The work belongs to a broader conversation happening across AI safety and applied NLP research about what happens when LLMs are inserted into high-stakes institutional processes, such as hiring, grant allocation, and editorial gatekeeping, without adversarial stress-testing. The GAN-inspired training loop is a meaningful methodological choice because it forces the defense to keep pace with an improving attacker rather than training against a fixed threat profile. That distinction matters for anyone evaluating whether a deployed review system is actually robust or just untested.
Watch whether any major preprint servers or journal platforms (arXiv, ICLR, NeurIPS) publicly acknowledge adversarial prompt risks in their LLM review pilots within the next two conference cycles. Silence from those venues would suggest the operational community has not yet engaged with this threat model seriously.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsSafeReview · Large Language Models · Generative Adversarial Networks
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.