CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Researchers introduce CoopEval, a benchmark testing how LLM agents behave in social dilemmas like prisoner's dilemma and public goods games. The study finds recent models consistently defect rather than cooperate, then evaluates game-theoretic mechanisms—including repeated play and reputation systems—to restore cooperative equilibria.

MentionsCoopEval · LLM agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

arXiv cs.LG·2d ago

Research

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

arXiv cs.CL·2d ago

Research

Context Over Content: Exposing Evaluation Faking in Automated Judges

arXiv cs.CL·2d ago

CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas

Related

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies

Context Over Content: Exposing Evaluation Faking in Automated Judges