Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10
Knowledge distillation effectiveness depends critically on student model capacity, not just teacher-student accuracy gaps, according to controlled experiments across ResNet pairs on CIFAR-10. The finding that larger students (R34) extract substantially more value from distillation than smaller ones (R18) even under identical teacher conditions challenges assumptions about scaling benefits in model compression. This has direct implications for practitioners designing efficient inference pipelines: capacity matching matters as much as training methodology, and Feature-KD outperforms Logit-KD in high-capacity regimes. The systematic reproduction across multiple seeds strengthens confidence in the result for practitioners building production distillation workflows.
Modelwire context
ExplainerThe paper isolates student capacity as a primary lever independent of teacher quality, suggesting that practitioners have been over-investing in teacher design while under-thinking student architecture selection. The implication is uncomfortable: you can't compress your way out of a fundamentally undersized model just by borrowing from a larger one.
This connects directly to the broader tension surfaced in the TinyML survey from this week around on-device learning constraints. That work emphasized how field conditions diverge from benchmarks; this study reveals a similar gap between what we assume works in distillation theory versus what actually scales in practice. Both point to the same lesson: controlled lab results often hide capacity or architectural mismatches that only surface under real deployment pressure. The finding also echoes the interpretability work on separability, which showed that interaction effects matter more than additive decompositions suggest. Here, the interaction between teacher quality and student size matters more than the teacher-student gap alone.
If subsequent work confirms that Feature-KD's advantage in high-capacity regimes persists when students approach teacher size (say, R50 student from R101 teacher), that validates the capacity-matching hypothesis. If the effect vanishes or reverses, it suggests the result is an artifact of the specific ResNet-CIFAR-10 pairing and won't generalize to other architectures or datasets.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsResNet · CIFAR-10 · Knowledge Distillation · Logit-KD · Feature-KD
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.