Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Researchers have solved a critical deployment bottleneck for tabular foundation models by distilling them into lightweight gradient-boosted trees that run on CPU in under 2ms, versus 151-1,275ms on GPU. The key innovation addresses label leakage in in-context learning teachers through stratified out-of-fold labeling, enabling XGBoost and CatBoost students to retain 96.5% of teacher accuracy while achieving 38-860x speedup. This bridges the gap between state-of-the-art tabular AI and real-world latency constraints in fraud detection and other time-sensitive applications, making foundation model quality accessible to resource-constrained production environments.
Modelwire context
ExplainerThe headline speedup figures are real, but the more consequential detail is the label leakage fix: prior attempts to use in-context learning models as teachers were quietly poisoned by the teacher seeing test labels during pseudo-label generation, which inflated student performance in evaluation and masked how poorly these pipelines would generalize in production.
This paper sits in a growing cluster of work on making foundation models viable outside GPU-rich environments. The KairosHope coverage from the same week (arXiv cs.LG, 2026-05-18) addressed a parallel problem in time-series: that foundation model architectures optimized for scale create friction in specialized, resource-constrained deployments. Both papers are responding to the same underlying pressure, which is that practitioners in finance, healthcare, and IoT cannot simply provision more compute to close the gap between research benchmarks and production constraints. The federated learning work from FedHybrid and FedNewton adds another angle here, since on-device inference and on-device training face related hardware ceilings. Together these papers suggest the field is converging on a pragmatic middle layer between raw model capability and real-world deployment.
Watch whether the TabICLv2 distillation pipeline holds its 96.5% accuracy retention on datasets outside OpenML-CC18, particularly on high-cardinality categorical benchmarks from TALENT, where gradient-boosted trees historically struggle relative to neural approaches. If it degrades significantly there, the method's scope is narrower than the framing implies.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsTabICLv2 · XGBoost · CatBoost · TALENT · OpenML-CC18 · TabZilla
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.