Research Models & Releases·arXiv cs.CL·2d ago

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

Japanese language processing remains a persistent weak point across both open and commercial LLMs, according to a new benchmark that isolates kanji reading and phonological reasoning. YOMI-Bench exposes a fundamental gap in how current models handle morphologically complex scripts where surface-level patterns fail. The finding matters because it reveals that language-specific model tuning hasn't solved structural linguistic challenges, suggesting that scaling alone won't close gaps in non-Latin writing systems. This points to a broader infrastructure problem: multilingual LLM development still treats character-level semantics as a solved problem when it clearly isn't.

Modelwire context

Explainer

YOMI-Bench isolates a specific failure: models can't reliably map written kanji to their phonetic readings, which requires reasoning about morphological structure rather than pattern matching. This isn't just poor Japanese performance; it's evidence that models lack a systematic representation of how writing systems encode sound.

This connects directly to the multilingual competence gap exposed in MSQA (released same day), which showed that language fluency doesn't guarantee reasoning about language-specific structure. Both papers suggest that scaling multilingual training data alone, even with resources like the 4.8-trillion-token MultiSynt/MT corpus, doesn't automatically solve character-level or morphological reasoning. The kanji problem is narrower but more fundamental: it's not about cultural knowledge or inference-time technique, but about whether models have learned the underlying linguistic machinery of non-Latin scripts.

If GPT-4.5 or Claude 3.2 (the models tested in YOMI-Bench) show measurable improvement on the same benchmark within the next six months without explicit Japanese phonology training, that signals the gap is closing through scale alone. If they don't, watch whether any vendor releases a Japanese-specific model variant that uses architectural changes (not just more data) to handle kanji reasoning; that would confirm the problem requires deliberate design, not just pretraining volume.

Coverage we drew on

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsYOMI-Bench · Japanese LLMs · Multilingual LLMs · GPT · Claude

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research