Research Tools & Code·arXiv cs.CL·May 22

CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test

CoSPlay addresses a critical bottleneck in LLM code generation: the dependency on ground-truth unit tests for training and inference. By enabling models to jointly refine both code and test quality through cooperative self-play without external test data, this framework removes a major constraint on scaling test-time compute for code tasks. The approach matters because it decouples code verification from expensive human-annotated test suites, potentially unlocking broader deployment of verifiable reward signals in production systems where such annotations are unavailable.

Modelwire context

Explainer

The key detail the summary gestures at but doesn't unpack is the cooperative structure itself: code generation and test generation are treated as two agents that mutually constrain each other, meaning neither output is treated as ground truth. This is a different bet than prior self-repair approaches, which typically held the test fixed and iterated only on code.

The verification problem CoSPlay targets has a structural cousin in the compliance and legal work we covered this week. The 'Asking For An Old Friend' statutory QA paper exposed what happens when a model's internal signals diverge from external ground truth, specifically in legal domains where annotated correct answers are scarce and expensive. CoSPlay is essentially proposing a general answer to that scarcity problem for code: if you can't get labeled tests, generate and co-refine them. Whether that answer travels to domains like statutory QA or the KYC pipelines in the SGER entity resolution work is an open question, but the underlying constraint (annotation bottlenecks in high-stakes tasks) is the same.

The critical test is whether CoSPlay's self-generated unit tests catch functionally incorrect code that passes superficial checks. If independent evaluations on HumanEval-hard or SWE-bench show false-positive rates comparable to human-written test suites, the annotation-free claim holds. If not, the cooperative loop is producing correlated failures, not genuine verification.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCoSPlay · RLVR · Test-Time Scaling

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.