Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Researchers have designed a three-faction variant of the Werewolf social-deduction game that exposes a critical blind spot in LLM reasoning: models fail to model opposing incentive structures even when those incentives directly contradict observable behavior. By introducing a Jester faction that wins through being eliminated, the study reveals that GPT-4.1 votes out the Jester on day one 60-70% of the time, a self-sabotaging move that suggests current models rely on surface-level language patterns rather than genuine multi-agent utility reasoning. Performance gaps between models and the effectiveness of self-learning across architectures hint at fundamental differences in how LLMs construct opponent models, raising questions about whether scaling alone closes this theory-of-mind gap.

Modelwire context

Explainer

The deeper finding isn't just that models vote wrong: it's that the Jester's winning condition is publicly stated in the game rules, meaning models have access to the correct incentive structure and still fail to apply it. This rules out information asymmetry as an excuse and points squarely at a reasoning failure, not a knowledge gap.

This story is largely disconnected from recent activity in our archive, as we have no prior coverage to anchor it to. It belongs to a growing body of academic work probing whether LLMs reason about other agents or merely pattern-match on surface text. That distinction matters because it sits underneath nearly every high-stakes deployment argument: autonomous negotiation, multi-agent orchestration, and AI-assisted decision-making all implicitly assume some capacity for modeling opposing incentives. The Jester result is a clean, reproducible stress test of that assumption, and its simplicity is actually its strength as a benchmark.

Watch whether the authors or independent replicators run this benchmark against reasoning-focused models like o3 or Gemini 2.5 Pro. If those models show meaningfully lower day-one Jester elimination rates, that would suggest chain-of-thought scaffolding partially compensates for the gap; if they don't, scaling and reasoning traces are both insufficient.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-4.1 · DeepSeek-V3.1 · Llama-3.3-70B · Werewolf

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.