Research Models & Releases·The Decoder·May 31

AI search agents often confirm what they already know instead of actually researching the web

A new benchmark from Harbin Institute of Technology exposes a critical weakness in deployed search agents: GPT-5.4 and Kimi K2.6 largely recycle training-data knowledge rather than genuinely researching the web. LiveBrowseComp, which isolates queries to events within the past 90 days, reveals that performance collapses when models cannot rely on memorized information. This finding reshuffles existing capability rankings and signals that current search-agent architectures may be fundamentally limited in real-time reasoning, raising questions about their practical utility for time-sensitive applications.

Modelwire context

Explainer

The 90-day recency window is not just a design choice, it is the entire argument: by restricting queries to events that postdate most training cutoffs, LiveBrowseComp removes the escape hatch that lets models fake retrieval by surfacing memorized answers. The benchmark is essentially a lie detector for search agents.

This story is largely disconnected from recent activity in our archive, as we have no prior coverage of search-agent evaluation methodology or the Harbin Institute team. It does, however, belong to a broader and increasingly active conversation about the gap between benchmark performance and real-world utility in deployed AI products. The core tension here, that models optimized on static corpora may be structurally ill-suited for live information tasks, has implications for every product currently marketed as an AI research assistant. The reshuffling of capability rankings is worth taking seriously because it suggests prior leaderboard results on web-search tasks may have been measuring memory, not reasoning.

Watch whether OpenAI or the Kimi team publish a direct response to LiveBrowseComp scores within the next 60 days, either by disputing the methodology or by releasing updated agents with verifiable retrieval traces. Silence from both would itself be informative.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-5.4 · Kimi K2.6 · Harbin Institute of Technology · LiveBrowseComp

Read full story at The Decoder →(the-decoder.com)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on the-decoder.com. If you’re a publisher and want a different summarization policy for your work, see our takedown page.