Research Models & Releases·arXiv cs.CL·4d ago

DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs

Researchers propose DOA, a training-free streaming policy that leverages decoder-only self-attention to guide simultaneous speech-to-text translation in SpeechLLMs. Unlike traditional encoder-decoder models that rely on explicit cross-attention alignment, this approach tests whether self-attention alone can provide stable signals for deciding when to read incoming audio versus emit translations. The work addresses a structural mismatch between how modern speech LLMs operate and the demands of real-time translation, with validation on long-form content where prior methods falter. This matters because it could unlock streaming translation capabilities in the growing class of decoder-only speech models without expensive retraining.

Modelwire context

Explainer

The key insight is that decoder-only models don't need retraining to handle streaming translation. By reading self-attention patterns during inference, the system can decide when to consume audio versus generate output without any architectural modification or supervised fine-tuning.

This connects to the broader shift toward decoder-only architectures we saw in the DRIFT paper from late May, which tackled efficiency in multi-turn LLM interactions through decoupled rollouts. DOA extends that efficiency logic to speech: rather than redesigning the model, it extracts existing signals from the forward pass. The work also complements the Translation Analytics benchmark from the same week, which helped practitioners evaluate local LLM translation quality. Where that work focused on offline evaluation under privacy constraints, DOA enables real-time streaming in models already deployed, addressing a different but adjacent deployment bottleneck.

If DOA maintains latency and quality parity with supervised baselines on the MUST-C or CoVoST-2 benchmarks when tested on languages outside the training distribution (e.g., low-resource pairs), that confirms the self-attention signal generalizes. If latency degrades significantly on utterances longer than 30 seconds, that signals the approach hits a horizon problem that retraining might solve.

Coverage we drew on

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSpeechLLMs · DOA (Decoder-Only Attention) · simultaneous speech-to-text translation

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.