CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Multimodal language models face a fundamental tradeoff in continual learning: shared parameters cause catastrophic forgetting across diverse vision-language tasks, while task-specific modules waste parameters at scale. CRAM resolves this by routing task-specific patterns into isolated expert modules while maintaining a shared backbone, enabling MLLMs to expand capabilities without replaying old data or sacrificing efficiency. This addresses a critical deployment constraint for production systems managing evolving task portfolios, positioning parameter-efficient continual tuning as a key frontier for real-world multimodal systems.

Modelwire context

Explainer

CRAM's centroid-routing mechanism is not just another MoE variant. The key insight is that it isolates task-specific drift into expert modules while preserving a shared backbone, which means continual tuning no longer forces a choice between forgetting old tasks or bloating parameters. The routing happens at the pattern level, not the token level, which is the architectural detail that makes replay-free continual learning feasible at scale.

This connects directly to the SubFit work from early June, which showed that redundancy clusters unevenly across model subcomponents and that surgical, fine-grained replacement outperforms full-layer approaches. CRAM applies the same principle to the continual learning problem: instead of treating the entire model as a monolith that either forgets or wastes capacity, it routes task-specific patterns into isolated experts while keeping shared computation lean. The SafeSteer paper from the same period reinforces this trend of localized intervention rather than global trade-offs. Together, these three papers signal a shift in how researchers approach efficiency: move away from blunt compression or broad retraining, toward targeted architectural routing that preserves what matters for each use case.

If CRAM's approach maintains performance parity with full replay-based continual learning on a held-out multimodal benchmark (e.g., a new task added after training on five prior vision-language datasets) without seeing any of the old task data, that confirms the routing mechanism actually prevents forgetting. If performance degrades by more than 5% relative to the baseline, the isolation strategy is incomplete and the method is primarily a parameter-efficiency win, not a forgetting solution.

Coverage we drew on

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCRAM · Multimodal Large Language Models · Mixture of Experts

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

arXiv cs.LG·1d ago

Research

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

arXiv cs.CL·1d ago

Research

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters

arXiv cs.CL·1d ago

CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning

Modelwire context

Coverage we drew on

Modelwire Editorial

Related

ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression

On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters