Research Models & Releases·arXiv cs.CL·5d ago

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Researchers propose Online Scaled DeltaNet (OSDN), a refinement to linear attention mechanisms that addresses a core limitation in state-space models: in-context associative recall. By introducing per-feature adaptive preconditioning via hypergradient feedback, OSDN improves upon the Delta Rule's fixed scalar gating without sacrificing the hardware efficiency that makes linear attention attractive versus softmax. The key insight is that diagonal preconditioning maps cleanly to per-feature key scaling, preserving the chunkwise parallel pipeline critical for practical deployment. This work matters because linear attention remains a serious contender for replacing softmax in long-context and memory-constrained settings, and closing the recall gap while maintaining computational efficiency directly impacts whether these models become production-viable.

Modelwire context

Explainer

OSDN's contribution isn't just better recall on linear attention, it's that diagonal preconditioning maps directly to per-feature key scaling without breaking the chunkwise parallel pipeline. That preservation of hardware efficiency is the actual novelty, not the recall improvement alone.

This connects directly to the broader pattern in recent research around in-context learning instability. The 'Many-Shot CoT-ICL' paper from May 13 showed that scaling demonstrations unpredictably across model types, and the 'Locale-Conditioned Few-Shot Prompting' work revealed that prompting strategy can outweigh hardware gains. OSDN addresses a related problem at the architectural level: linear attention models struggle with associative recall during long-context inference, which undermines their practical appeal despite efficiency advantages. By closing that gap while keeping the pipeline intact, OSDN removes a structural blocker that has kept linear attention from competing with softmax in production settings where both recall quality and inference speed matter.

If OSDN-based models match or exceed softmax-attention baselines on the RULER benchmark (which tests associative recall at 4K+ context lengths) while maintaining sub-linear memory scaling, the architecture becomes a credible production alternative. If the gains disappear on retrieval-heavy tasks like those in the PersonalAI 2.0 paper, the improvement is narrower than claimed.

Coverage we drew on

Many-Shot CoT-ICL: Making In-Context Learning Truly Learn · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsOSDN · DeltaNet · Delta Rule · Linear Attention · State-Space Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.