Modelwire
Subscribe

DiscoBench shows search agents fail by searching, not asking

Illustration accompanying: AI search agents don't fail at searching, they fail at asking the right questions when queries get ambiguous

A new benchmark called DiscoBench reveals a critical failure mode in AI search agents: they perform worse when they attempt repeated searches on ambiguous queries instead of requesting clarification from users. The best-performing models achieve only 43 percent accuracy overall, while those that search iteratively without asking follow-up questions drop to 51.9 percent. Removing query ambiguity improves accuracy by up to 40 points, suggesting that agent design must prioritize interactive disambiguation over autonomous search loops. This finding reshapes expectations around agentic AI systems, indicating that production search agents need explicit uncertainty handling and user interaction protocols to function reliably.

Modelwire context

Explainer

The benchmark's deeper implication isn't that these agents are bad at searching, it's that they're architecturally overconfident: they default to autonomous looping rather than surfacing uncertainty to users, which means the failure mode is baked into how most agentic pipelines are designed, not just how individual models perform.

This connects directly to the multi-agent architecture work covered in 'Conversable Complexity' from early July, which framed agentic LLM collectives as systems where linguistic interaction is the primary mechanism for transparency and coordination. DiscoBench suggests that same interaction layer is being bypassed in production search agents, with agents substituting iteration for communication. It also rhymes with the RAG diagnostic work covered around the same time ('What Survives Into Context'), where the problem wasn't retrieval quality per se but what gets prioritized under constraint. Both papers point at the same design gap: agents optimizing for autonomy at the cost of reliability.

Watch whether any of the major agentic search products (Perplexity, SearchGPT, or similar) ship explicit disambiguation prompts as a configurable behavior within the next two quarters. If they do, DiscoBench will have had measurable influence on production design; if not, it stays a benchmark paper.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsDiscoBench

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The Decoder originally reported this story as AI search agents don't fail at searching, they fail at asking the right questions when queries get ambiguous”. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.