Research Models & Releases·arXiv cs.CL·1d ago

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Korean-language AI evaluation has lagged behind English benchmarks, masking performance gaps in frontier models when deployed in non-English contexts. K-BrowseComp exposes a critical weakness: GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 drop from strong BrowseComp scores to 30-45% accuracy on Korean web-browsing tasks, while locally developed models score near zero. This gap signals that agentic reasoning and web interaction remain brittle outside dominant training languages, raising questions about real-world deployment readiness in non-English markets and the adequacy of current multilingual training approaches.

Modelwire context

Explainer

K-BrowseComp doesn't just show that frontier models perform worse in Korean; it isolates a specific failure mode: agentic reasoning (planning, navigation, form-filling across web pages) degrades far more sharply than base language understanding would predict, suggesting the brittleness isn't translation but reasoning under unfamiliar information structures.

This connects directly to two concurrent threads in our coverage. First, the SN-WER paper from the same day identified how script and encoding mismatches mask true performance gaps in multilingual systems; K-BrowseComp shows the same masking effect but at the reasoning layer rather than the acoustic layer. Second, the AGENTCL framework published today emphasizes that current benchmarks conflate retrieval with genuine learning and adaptation. K-BrowseComp suggests agents aren't adapting to Korean web contexts at all, which is a more severe version of the continual learning problem: not just forgetting, but failing to reason in unfamiliar linguistic and structural environments.

If the same three frontier models score within 5-10 percentage points of their BrowseComp baselines on a Japanese or Vietnamese web-browsing benchmark (both morphologically and structurally different from English but with larger training corpora than Korean), that would suggest the gap is corpus size, not reasoning brittleness. If the gap persists across multiple non-English languages, it confirms agents need fundamentally different training approaches for non-English deployment.

Coverage we drew on

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT-5.5 · DeepSeek-V4-Pro · GLM-5.1 · K-BrowseComp · Korea Proprietary AI Foundation Model program

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Modelwire context

Coverage we drew on

Modelwire Editorial

Related

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Amazon Shuts Down Internal AI Leaderboard After Employees Cheated

Gemini’s new AI agent is about as good as Google’s demo