CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Multimodal language models face a fundamental tradeoff in continual learning: shared parameters cause catastrophic forgetting across diverse vision-language tasks, while task-specific modules waste parameters at scale. CRAM resolves this by routing task-specific patterns into isolated expert modules while maintaining a shared backbone, enabling MLLMs to expand capabilities without replaying old data or sacrificing efficiency. This addresses a critical deployment constraint for production systems managing evolving task portfolios, positioning parameter-efficient continual tuning as a key frontier for real-world multimodal systems.
Modelwire context
ExplainerCRAM's centroid-routing mechanism is not just another MoE variant. The key insight is that it isolates task-specific drift into expert modules while preserving a shared backbone, which means continual tuning no longer forces a choice between forgetting old tasks or bloating parameters. The routing happens at the pattern level, not the token level, which is the architectural detail that makes replay-free continual learning feasible at scale.
This connects directly to the SubFit work from early June, which showed that redundancy clusters unevenly across model subcomponents and that surgical, fine-grained replacement outperforms full-layer approaches. CRAM applies the same principle to the continual learning problem: instead of treating the entire model as a monolith that either forgets or wastes capacity, it routes task-specific patterns into isolated experts while keeping shared computation lean. The SafeSteer paper from the same period reinforces this trend of localized intervention rather than global trade-offs. Together, these three papers signal a shift in how researchers approach efficiency: move away from blunt compression or broad retraining, toward targeted architectural routing that preserves what matters for each use case.
If CRAM's approach maintains performance parity with full replay-based continual learning on a held-out multimodal benchmark (e.g., a new task added after training on five prior vision-language datasets) without seeing any of the old task data, that confirms the routing mechanism actually prevents forgetting. If performance degrades by more than 5% relative to the baseline, the isolation strategy is incomplete and the method is primarily a parameter-efficiency win, not a forgetting solution.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsCRAM · Multimodal Large Language Models · Mixture of Experts
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.