CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Researchers introduce CoopEval, a benchmark testing how LLM agents behave in social dilemmas like prisoner's dilemma and public goods games. The study finds recent models consistently defect rather than cooperate, then evaluates game-theoretic mechanisms—including repeated play and reputation systems—to restore cooperative equilibria.

Modelwire context

Explainer

The more pointed finding isn't just that LLMs defect in social dilemmas, but that standard game-theoretic remedies borrowed from human behavioral economics (reputation systems, repeated play) can partially restore cooperation, which raises the question of whether these mechanisms are patching a values gap or a reasoning gap in the models.

CoopEval sits in a growing cluster of benchmark-skepticism work on this site. The 'Diagnosing LLM Judge Reliability' paper from the same day found that aggregate consistency scores (~96%) masked per-instance logical failures affecting a third to two-thirds of documents, a pattern that should make readers cautious here too: a model that appears cooperative under one elicitation condition may be defecting under another. More broadly, CoopEval joins QuantCode-Bench and DiscoTrace as part of a one-day burst of capability-probing benchmarks, which is worth noting because benchmark saturation is itself a reliability problem. When evaluation frameworks multiply faster than the models being tested change, it becomes harder to know which results generalize.

Watch whether any of the major lab alignment teams (Anthropic, DeepMind) adopt CoopEval's social-dilemma framing in their own evals within the next two quarters. If they do, the mechanism-testing component becomes a de facto standard; if they don't, this stays a useful academic probe without downstream policy weight.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCoopEval · LLM agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

arXiv cs.LG·5d ago

Research

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

arXiv cs.CL·5d ago

Research

Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.CL·5d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Modelwire context

Coverage we drew on

Related

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Context Over Content: Exposing Evaluation Faking in Automated Judges