CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Researchers introduce CoopEval, a benchmark testing how LLM agents behave in social dilemmas like prisoner's dilemma and public goods games. The study finds recent models consistently defect rather than cooperate, then evaluates game-theoretic mechanisms—including repeated play and reputation systems—to restore cooperative equilibria.
Modelwire context
ExplainerThe more pointed finding isn't just that LLMs defect in social dilemmas, but that standard game-theoretic remedies borrowed from human behavioral economics (reputation systems, repeated play) can partially restore cooperation, which raises the question of whether these mechanisms are patching a values gap or a reasoning gap in the models.
CoopEval sits in a growing cluster of benchmark-skepticism work on this site. The 'Diagnosing LLM Judge Reliability' paper from the same day found that aggregate consistency scores (~96%) masked per-instance logical failures affecting a third to two-thirds of documents, a pattern that should make readers cautious here too: a model that appears cooperative under one elicitation condition may be defecting under another. More broadly, CoopEval joins QuantCode-Bench and DiscoTrace as part of a one-day burst of capability-probing benchmarks, which is worth noting because benchmark saturation is itself a reliability problem. When evaluation frameworks multiply faster than the models being tested change, it becomes harder to know which results generalize.
Watch whether any of the major lab alignment teams (Anthropic, DeepMind) adopt CoopEval's social-dilemma framing in their own evals within the next two quarters. If they do, the mechanism-testing component becomes a de facto standard; if they don't, this stays a useful academic probe without downstream policy weight.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsCoopEval · LLM agents
Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.