Research Models & Releases·arXiv cs.CL·3d ago

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

A systematic benchmark of seven foundation models on Ukrainian legal text reveals stark efficiency gaps that reshape deployment economics. Tokenizer fertility varies 1.6x across providers, with Qwen3 consuming 60% more tokens than Llama-family models on identical input. More striking: NVIDIA Nemotron Super 3 (120B) outperforms Mistral Large 3 despite having 5.6x fewer total parameters and 3.4x fewer active parameters per token, while costing one-third as much via API. The finding that few-shot prompting degrades performance by up to 26% challenges conventional scaling wisdom. For practitioners, this work quantifies the hidden cost of tokenizer inefficiency and suggests parameter count alone is a poor proxy for real-world value.

Modelwire context

Analyst take

The Ukrainian legal corpus is doing real work here beyond localization testing. Legal text is dense, formulaic, and terminology-heavy in ways that stress tokenizer vocabulary design, so fertility gaps that look modest on English benchmarks get amplified. This makes the 1.6x fertility spread a more credible signal than a general-domain comparison would produce.

The core argument, that parameter count is a poor proxy for value, runs directly parallel to what we covered in 'Small, Private Language Models as Teammates for Educational Assessment Design,' where smaller locally-deployed models matched larger ones on pedagogical tasks. Both papers are building the same evidentiary case from different angles: deployment context and efficiency metrics matter more than headline scale. The few-shot degradation finding also complicates the picture from 'Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance,' which showed strong results from minimal demonstrations. That work focused on training-time use of few-shot examples, not inference-time prompting, so the tension is real but not a direct contradiction. Still, practitioners using few-shot prompting as a default should treat the 26% degradation figure as a flag worth investigating in their own pipelines.

If Qwen releases a tokenizer update that closes the fertility gap with Llama-family models within the next two release cycles, watch whether the zero-shot performance advantage Llama currently holds narrows proportionally. That would confirm tokenizer design, not model architecture, is the primary driver of the efficiency differential.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsQwen3 · Llama · NVIDIA Nemotron Super 3 · Mistral Large 3 · Ukraine EDRSR

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.