Research Tools & Code·arXiv cs.LG·14h ago

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

SWE-Interact reframes software engineering benchmarks around realistic developer workflows rather than autonomous task completion. Instead of handing agents complete specifications upfront, the testbed simulates iterative collaboration: a user simulator begins with vague requirements, inspects intermediate work, and progressively refines constraints. This shift matters because it exposes whether coding agents can handle requirement discovery, adapt to feedback loops, and build incrementally on their own output. The benchmark design reflects how actual development teams operate, making it a more honest stress test for production-ready coding systems than existing autonomous-only evaluations.

Modelwire context

Explainer

The deeper provocation here is not just that existing benchmarks are incomplete, but that they may be actively misleading: an agent that scores well on specification-complete tasks could still fail entirely when requirements arrive incrementally, which is how almost all real software projects actually begin.

This connects directly to the Agents-A1 work covered the same day ('Scaling the Horizon, Not the Parameters'), which showed that extending agent trajectories to 45K tokens was the key variable in matching frontier performance. SWE-Interact is essentially asking whether those long-horizon trajectories hold up when the goal itself is shifting mid-trajectory, not just when the path to a fixed goal is long. The WorldEvolver piece on self-evolving world models is also relevant here: if agents cannot maintain reliable internal models of a user's evolving intent, the feedback loops SWE-Interact introduces will expose that failure directly. Together, these papers suggest the field is converging on a shared problem: agents that plan well under stable conditions but degrade when the environment, or the specification, keeps changing.

Watch whether any of the major coding agent labs (Cognition, Cursor, or similar) adopt SWE-Interact as a reported metric within the next two quarters. Adoption by even one production-facing team would signal the benchmark has traction beyond academic evaluation.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSWE-Interact · SWE benchmarks

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.