Reducing cross-sample prediction churn in scientific machine learning

A new study exposes a critical blind spot in scientific machine learning: models trained on different data samples agree on overall accuracy but flip predictions on 8-22% of individual test cases. This 'cross-sample prediction churn' undermines confidence in reported benchmarks across chemistry applications. While standard uncertainty techniques (deep ensembles, MC dropout) fail to address it, two data-side methods show promise, with K-bootstrap bagging reducing churn 40-54% without sacrificing accuracy. The finding signals that aggregate metrics mask instability in real-world deployment, forcing practitioners to rethink how they validate and report model reliability.

Modelwire context

Explainer

The study's sharpest implication isn't that models are unstable, it's that the field's standard reporting conventions actively hide that instability. Aggregate accuracy scores can be identical across training runs while the underlying per-sample decisions are essentially shuffled, meaning two labs could publish the same benchmark number and be describing functionally different models.

This connects directly to the concern raised in our coverage of 'Quantifying Sensitivity for Tree Ensembles,' which introduced formal methods for identifying misclassification-prone regions under small input perturbations. Both papers are circling the same problem from different directions: reported metrics don't tell you where a model will fail, only how often. The churn finding extends that concern from adversarial perturbations to ordinary resampling, which is arguably more alarming because it requires no adversary at all. The Hodge decomposition paper from the same period also touches this nerve, arguing that physics-informed models need structural inductive biases precisely because standard training objectives don't enforce the consistency properties scientists actually need.

Watch whether chemistry benchmark maintainers, particularly those behind QM9 or OC20 leaderboards, adopt per-sample stability reporting alongside aggregate metrics within the next 12 months. If they don't, the churn finding stays a paper result rather than a field-level norm change.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsK-bootstrap bagging · twin-bootstrap · deep ensembles · MC dropout · stochastic weight averaging

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.