You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Researchers propose cross-layer sparse attention (CLSA), a technique that accelerates long-context LLM inference by sharing routing indices across decoder layers alongside KV caches. The approach targets a persistent bottleneck in reasoning-heavy workloads: existing sparse attention methods either sacrifice quality for speed (block sparse) or remain computationally expensive at scale (token sparse). By amortizing the cost of top-k routing across multiple layers, CLSA aims to unlock practical speedups without accuracy loss, directly addressing the efficiency ceiling that constrains deployment of extended reasoning in production systems.

Modelwire context

Explainer

The core insight isn't just about skipping attention computation: it's that routing decisions, specifically which tokens matter, are stable enough across consecutive layers that you can compute them once and reuse them, treating the routing index as a shared resource rather than a per-layer cost. That assumption about inter-layer routing stability is the bet the whole paper rides on.

This connects directly to the compression and efficiency thread running through recent coverage. The 'From Layers to Submodules' piece from June 1st argued that redundancy in LLMs clusters unevenly across architectural components, which is essentially the same intuition CLSA exploits at the attention routing level. Both papers are pushing toward component-aware efficiency rather than blunt whole-layer interventions. The SimSD work on speculative decoding for diffusion models is also adjacent: different architecture, same underlying pressure to close the gap between theoretical inference speed and what actually ships in production.

The critical test is whether routing-index reuse holds up on tasks where attention patterns shift sharply between layers, such as multi-hop reasoning benchmarks. If CLSA shows quality degradation specifically on those tasks while maintaining speed gains on simpler retrieval tasks, that would confirm the stability assumption has a meaningful boundary condition practitioners need to plan around.

Coverage we drew on

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsYOCO · KV-sharing architectures · cross-layer sparse attention

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.