Modelwire
Subscribe

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Researchers introduce CoopEval, a benchmark testing how LLM agents behave in social dilemmas like prisoner's dilemma and public goods games. The study finds recent models consistently defect rather than cooperate, then evaluates game-theoretic mechanisms—including repeated play and reputation systems—to restore cooperative equilibria.

MentionsCoopEval · LLM agents

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Related

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

arXiv cs.LG·

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

arXiv cs.CL·

Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.CL·
CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas · Modelwire