Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents

Researchers tested whether collective intelligence emerges in large agent societies by probing a 2M-agent platform called MoltBook with hierarchical reasoning tasks. The study found no evidence that scale alone produces emergent group intelligence, with agent collectives underperforming individual frontier models on complex reasoning.

Modelwire context

Explainer

The more pointed finding is directional: agent collectives on MoltBook didn't just fail to exceed frontier models, they fell short of them on complex reasoning, which puts a concrete ceiling on the 'more agents equals more intelligence' assumption that underlies several commercial multi-agent architectures currently being built out.

This connects directly to the token consumption research covered the same day ('How Do AI Agents Spend Your Money'), which showed agentic workflows already burn 1000x more tokens than standard inference. Taken together, the two papers sketch a cost-capability picture that should concern anyone scaling agent societies: you pay exponentially more while collective performance may actually regress on hard tasks. The MoltBook result also adds empirical weight to the implicit tension in the aggregate-vs-personalized judges paper, which asked whether pooling LLM evaluators produces better judgments. Apparently pooling agents at scale doesn't reliably produce better reasoning either.

Watch whether the MoltBook team or independent researchers can identify a task class where collective agent performance does exceed frontier solo models. If no such class surfaces within the next two or three benchmark cycles, the architectural case for massive agent societies weakens considerably.

Coverage we drew on

How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMoltBook · Superminds Test · LLM agents

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.