Research Models & Releases·arXiv cs.CL·1d ago

NAVER LABS Europe Submission to the Instruction-following 2026 Short Track

NAVER LABS Europe advances multimodal speech processing by introducing SpeechMapper, a novel method for training speech-to-LLM projectors using only ASR data, alongside a synthetic scientific-speech dataset called fakACL. The work extends their prior IWSLT championship system to jointly handle automatic speech recognition, speech translation, and speech question-answering across English to Chinese, Italian, and German. This represents incremental but meaningful progress in constrained multilingual speech-LLM integration, signaling how industrial labs are optimizing embedding alignment without paired multimodal supervision.

Modelwire context

Explainer

The key innovation isn't the speech translation task itself, but the constraint: NAVER LABS Europe trained their projector using only ASR data, sidestepping the need for expensive paired speech-to-embedding annotations. This matters because paired multimodal supervision remains a bottleneck in scaling speech-LLM systems across languages.

This work sits at the intersection of two recent trends in Modelwire coverage. The MultiSynt/MT paper from yesterday showed that synthetic data can compress the efficiency gap in multilingual model development by 28 percent. NAVER LABS Europe is applying a similar principle to the speech domain: using synthetic scientific speech (fakACL) and ASR-only training to avoid collecting paired multimodal data at scale. Meanwhile, the phonology-informed TTS evaluation from today exposes how neural systems fail on linguistic fidelity in low-resource languages. NAVER LABS Europe's approach sidesteps this by anchoring to ASR outputs rather than raw speech, trading some acoustic fidelity for reproducibility and scalability across their target languages.

If NAVER LABS Europe's system maintains performance parity with systems trained on paired speech-embedding data when evaluated on the IWSLT 2026 test set, that validates ASR-only projection as a viable path for industrial multilingual speech-LLM deployment. If performance degrades by more than 3-5 BLEU points on the speech translation task, the constraint becomes too costly and paired supervision remains necessary.

Coverage we drew on

MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsNAVER LABS Europe · IWSLT 2026 · SpeechMapper · fakACL · SeamlessM4

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.