AdaSplash-2: Faster Differentiable Sparse Attention

AdaSplash-2 accelerates differentiable sparse attention for transformers by using histogram-based initialization to compute the normalizer in 1–2 iterations instead of many, reducing computational overhead while maintaining input-dependent sparsity for long-context training.
MentionsAdaSplash-2 · α-entmax attention · transformers
Read full story at arXiv cs.CL →(arxiv.org)
Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.