DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO addresses a critical constraint in deploying sparse mixture-of-experts models on resource-limited devices by matching dense transformer performance within identical parameter budgets. The architecture combines differentiable ReLU routing with learnable expert scaling and introduces NormSiLU activation to reduce the storage and memory-access overhead that typically makes MoE models impractical for edge deployment. This work matters because it directly tackles the gap between MoE's theoretical efficiency gains and real-world on-device constraints, potentially unlocking efficient inference for mobile and embedded systems without sacrificing model quality.
Modelwire context
ExplainerThe real buried detail here is NormSiLU: most MoE efficiency work focuses on routing overhead, but DECO's authors are claiming that activation function choice is a meaningful contributor to the memory-access bottleneck on constrained hardware, which is a less commonly targeted lever in this design space.
DECO and the ELF paper (also from arXiv cs.CL, same day) represent two distinct bets on where architectural efficiency gains come from. ELF argues the discretization step in language diffusion is the costly abstraction worth removing. DECO argues the problem is how sparsity is routed and activated inside a transformer. Neither paper directly informs the other, but together they illustrate a broader pattern in current research: practitioners are no longer treating the standard transformer block as fixed, and are instead decomposing it component by component to find where the real compute and memory costs actually live on target hardware.
The credibility test for DECO is whether independent teams can reproduce the dense-comparable performance claims on commodity edge chips (Snapdragon, Apple Neural Engine) rather than the controlled benchmarks typical of academic submissions. If third-party on-device numbers surface within six months, the routing and activation design choices here deserve serious attention from mobile inference teams.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsDECO · Mixture-of-Experts · ReLU · NormSiLU · Transformer
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.