A Causal Language Modeling Detour Improves Encoder Continued Pretraining

Researchers demonstrate that encoder models benefit from a temporary shift to causal language modeling during domain adaptation, followed by masked language modeling decay. Testing on biomedical datasets with ModernBERT shows consistent gains of 0.3-2.8 percentage points across 19 tasks in French and English. The mechanism appears to involve deeper representational changes in lower transformer layers that persist through the subsequent MLM phase, suggesting that pretraining schedules merit reconsideration beyond standard masked language modeling approaches.

Modelwire context

Explainer

The paper's actual contribution is showing that the *order* of pretraining objectives matters more than previously assumed. It's not that CLM is better than MLM for encoders, but that a temporary detour through CLM leaves persistent representational traces that make subsequent MLM more effective on downstream tasks.

This connects to the broader pretraining reliability question surfaced in the ORCE paper from the same day. Just as ORCE separates confidence calibration from answer generation to prevent objective conflicts, this work suggests that encoder pretraining also benefits from task separation and sequencing rather than joint optimization. Both papers challenge the assumption that a single, unified training objective is optimal. The difference: ORCE addresses deployment safety, while this addresses the upstream pretraining phase itself.

If ModernBERT or other encoder variants adopt this CLM-then-MLM schedule in production releases within the next six months, and if downstream task performance improvements match the 0.3-2.8 point gains reported here, that confirms the mechanism generalizes beyond the biomedical domain tested. If gains disappear on out-of-domain tasks or require domain-specific tuning of the CLM-to-MLM transition point, the finding is narrower than presented.

Coverage we drew on

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsModernBERT · Masked Language Modeling · Causal Language Modeling

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.