Research Tools & Code·arXiv cs.CL·2d ago

AdaSplash-2: Faster Differentiable Sparse Attention

AdaSplash-2 accelerates differentiable sparse attention for transformers by using histogram-based initialization to compute the normalizer in 1–2 iterations instead of many, reducing computational overhead while maintaining input-dependent sparsity for long-context training.

MentionsAdaSplash-2 · α-entmax attention · transformers

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

AdaSplash-2: Faster Differentiable Sparse Attention

Related

Benchmarking Optimizers for MLPs in Tabular Deep Learning

Stability and Generalization in Looped Transformers

Gemini 3.1 Flash TTS: the next generation of expressive AI speech