Modelwire
Subscribe

AdaSplash-2: Faster Differentiable Sparse Attention

AdaSplash-2 accelerates differentiable sparse attention for transformers by using histogram-based initialization to compute the normalizer in 1–2 iterations instead of many, reducing computational overhead while maintaining input-dependent sparsity for long-context training.

MentionsAdaSplash-2 · α-entmax attention · transformers

Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Related

Benchmarking Optimizers for MLPs in Tabular Deep Learning

arXiv cs.LG·

Stability and Generalization in Looped Transformers

arXiv cs.LG·

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

AdaSplash-2: Faster Differentiable Sparse Attention · Modelwire