Procedural-skill SFT across capacity tiers: A W-Shaped pre-SFT Trajectory and Regime-Asymmetric Mechanism on 0.8B-4B Qwen3.5 Models

Researchers studying Qwen3.5 models across three scales (0.8B to 4B parameters) uncovered a counterintuitive pattern in how supervised fine-tuning affects procedural reasoning. The W-shaped pre-training trajectory reveals that procedural tasks hurt the smallest and largest models but help the 2B variant, yet SFT gains remain consistent across sizes. This finding challenges assumptions about uniform scaling benefits and suggests that model capacity interacts with task structure in non-monotonic ways, with implications for efficient fine-tuning strategies in the sub-5B tier where inference cost matters most.
Modelwire context
ExplainerThe paper's core finding is not just that procedural tasks help some models and hurt others, but that this non-monotonic effect persists after supervised fine-tuning. This suggests the capacity-task mismatch is structural, not a pre-training artifact that SFT erases.
This connects directly to the RuDE framework from earlier this month, which tackled the problem of predicting which base models will actually adapt well to downstream tasks. RuDE used rubric-guided contrastive pairs to forecast fine-tuning potential; this Qwen study provides empirical evidence that such forecasting is necessary because capacity tier interacts with task type in ways that uniform scaling intuitions miss. Together they suggest that model selection for fine-tuning pipelines requires task-aware analysis, not just raw parameter counts. The finding also echoes the StepCodeReasoner work on procedural reasoning alignment, though from a different angle: where StepCodeReasoner enforces correctness through execution traces, this research reveals that smaller and larger models may have fundamentally different capacity to absorb procedural structure.
If Alibaba or other teams release fine-tuning benchmarks on the 2B Qwen3.5 variant specifically for procedural tasks (code generation, math reasoning, planning) in the next two quarters, and it consistently outperforms the 0.8B and 4B siblings, that confirms the W-shaped pattern is reproducible and not an artifact of this particular evaluation setup. Absence of such targeted benchmarking would suggest the finding is too narrow to influence production model selection.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsQwen3.5 · Claude Haiku 4.5 · Claude Opus
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.