Depth-Attention: Cross-Layer Value Mixing for Language Models

Depth-Attention proposes a structural fix to how Transformers reuse information across layers. Current models add each layer's output to a residual stream without selective cross-layer retrieval, forcing later layers to work with whatever earlier layers contributed. This paper embeds cross-layer value selection directly into the attention mechanism, letting queries at each layer attend to and mix key-value pairs from prior layers at matching token positions. The approach avoids the inference-time memory overhead that plagues existing cross-layer methods, a constraint that sharpens as production LLMs adopt aggressive cache compression via grouped-query and multi-head latent attention. For practitioners optimizing inference cost and model depth, this signals a path to richer layer interaction without sacrificing throughput.

Modelwire context

Explainer

The key constraint Depth-Attention solves is inference-time memory overhead, which becomes acute as production models adopt grouped-query and multi-head latent attention for cache compression. This isn't just a richer layer interaction pattern; it's a memory-aware redesign that lets practitioners deepen models without sacrificing the compression gains they've already deployed.

This connects directly to the layer-granularity work from early June. Just as SubFit showed that redundancy clusters unevenly across attention and feedforward submodules rather than at full-layer granularity, Depth-Attention reveals that layer interaction itself can be made selective and efficient. Both papers challenge the assumption that architectural components should be treated uniformly. The TaDA paper from the same week also found that different signals concentrate at different depths (domain knowledge deeper, task signals shallower), suggesting that cross-layer value mixing could exploit these asymmetries more intelligently than flat residual addition.

If Depth-Attention ships in a production model (Llama, Gemma, or similar open-weight release) within the next two quarters, measure whether inference latency stays flat or improves despite adding cross-layer retrieval. If latency increases by more than 3-5% on standard benchmarks, the memory overhead claim doesn't hold in practice and adoption will stall.

Coverage we drew on

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsDepth-Attention · Transformers · grouped-query attention · multi-head latent attention

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.