Foundation Models vs. Radiomics for Lung Computed Tomography: A Benchmark of Feature Extractors, Classification Heads, and Segmentation Choices

A systematic benchmark comparing foundation models against classical radiomics for lung cancer diagnosis reveals how feature extraction, classifier choice, and segmentation strategy each shape real-world performance. Testing five extractors (including DINOv3 and Curia variants) across seven classifiers on survival prediction, histology, and staging tasks, researchers prioritized cross-cohort robustness over in-distribution accuracy, exposing which architectural combinations generalize beyond their training hospital. This work matters because medical AI deployment hinges on worst-case external validity, not benchmark leaderboards, and isolating each component's contribution helps practitioners avoid false confidence in foundation model hype.
Modelwire context
ExplainerThe study's real contribution isn't that foundation models sometimes beat radiomics or vice versa, but that segmentation strategy and classifier choice often matter more than the feature extractor itself. This inverts the typical narrative where architecture dominates.
This work shares DNA with the RF drone benchmark paper from earlier today, which exposed how standard evaluation splits mask overfitting in time-series tasks through data leakage. Both papers argue that methodological choices in how you slice and validate data can inflate reported performance beyond what generalizes. The lung CT study goes further by systematically isolating each component (feature extractor, classifier, segmentation) to show practitioners which knobs actually control real-world robustness. Where the drone paper caught a flaw, this one builds a framework to avoid it in medical imaging.
If the same five extractors are benchmarked on an independent lung cancer cohort (NLST or similar) within six months and the ranking of classifiers holds stable, that confirms the generalization claims. If rankings flip, the current findings are cohort-specific and the paper's guidance for practitioners becomes less actionable.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsCuria · DINOv3 · TabPFN · XGBoost · LUNG1 · LUNG2
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.