Research Models & Releases·arXiv cs.LG·Apr 17

SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems

Researchers released SocialGrid, a benchmark environment modeled on Among Us that tests LLM agents on planning and social reasoning in multi-agent settings. GPT-OSS-120B and other models scored below 60% on task completion, revealing navigation and planning failures that confound measurement of actual social intelligence.

Modelwire context

Explainer

The deeper problem SocialGrid surfaces is a measurement validity issue: when agents fail at navigation and spatial planning, you can't isolate whether they're also failing at social deduction, because the two failure modes are entangled in the same score. The benchmark may be measuring motor-cognitive bottlenecks more than theory-of-mind.

This lands one day after CoopEval (covered April 16), which tested LLM agents in social dilemmas like prisoner's dilemma and found models defaulting to defection rather than cooperation. Together, the two papers sketch a consistent picture: LLMs struggle with multi-agent social reasoning across very different task framings, whether the setting is abstract game theory or embodied grid navigation. The difference is that CoopEval's failure mode is clean and attributable, while SocialGrid's confounded design makes it harder to know what exactly is breaking. That distinction matters for anyone trying to use benchmark results to guide model development rather than just rank models.

Watch whether follow-up work disentangles navigation competence from social reasoning by holding movement ability constant, perhaps through oracle-pathing ablations. If scores remain low under those conditions, the social intelligence deficit is real; if they recover substantially, SocialGrid is measuring the wrong thing.

Coverage we drew on

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSocialGrid · GPT-OSS-120B · Among Us · LLM

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.