Modelwire
Subscribe

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Illustration accompanying: DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Researchers propose DOA, a training-free streaming policy that leverages decoder-only self-attention to guide simultaneous speech-to-text translation in SpeechLLMs. Unlike traditional encoder-decoder models that rely on explicit cross-attention alignment, this approach tests whether self-attention alone can provide stable signals for deciding when to read incoming audio versus emit translations. The work addresses a structural mismatch between how modern speech LLMs operate and the demands of real-time translation, with validation on long-form content where prior methods falter. This matters because it could unlock streaming translation capabilities in the growing class of decoder-only speech models without expensive retraining.

Modelwire context

Explainer

The key insight is that decoder-only models don't need retraining to handle streaming translation. By reading self-attention patterns during inference, the system can decide when to consume audio versus generate output without any architectural modification or supervised fine-tuning.

This connects to the broader shift toward decoder-only architectures we saw in the DRIFT paper from late May, which tackled efficiency in multi-turn LLM interactions through decoupled rollouts. DOA extends that efficiency logic to speech: rather than redesigning the model, it extracts existing signals from the forward pass. The work also complements the Translation Analytics benchmark from the same week, which helped practitioners evaluate local LLM translation quality. Where that work focused on offline evaluation under privacy constraints, DOA enables real-time streaming in models already deployed, addressing a different but adjacent deployment bottleneck.

If DOA maintains latency and quality parity with supervised baselines on the MUST-C or CoVoST-2 benchmarks when tested on languages outside the training distribution (e.g., low-resource pairs), that confirms the self-attention signal generalizes. If latency degrades significantly on utterances longer than 30 seconds, that signals the approach hits a horizon problem that retraining might solve.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSpeechLLMs · DOA (Decoder-Only Attention) · simultaneous speech-to-text translation

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs · Modelwire