Sample Complexity of Scientific Discovery: PAC Learnability of Compositional Function Trees

Symbolic regression, a core technique for automated scientific discovery, has long been dismissed as statistically intractable due to combinatorial explosion in hypothesis space. This paper establishes PAC learning bounds showing that generalization complexity for compositional function trees depends on operator depth and Lipschitz smoothness rather than exponential blowup with structure count. The result narrows the gap between theory and practice for neural-symbolic systems, suggesting that well-behaved operator vocabularies can enable tractable discovery at scale. This matters for researchers building interpretable ML systems and for the broader push toward AI that can autonomously uncover scientific laws.

Modelwire context

Explainer

The paper's real contribution is narrower than it might appear: it shows PAC learnability is possible under specific structural constraints (bounded operator depth, Lipschitz smoothness), not that symbolic regression is suddenly practical at scale. The bounds still depend on these properties, meaning the hard part (designing vocabularies that satisfy them) remains unsolved.

This connects directly to the June 28 post-hoc explanations paper, which argued that opaque models trained via gradient descent don't necessarily capture true mechanistic structure even when they predict well. This symbolic regression work approaches the inverse problem: it asks whether we can learn interpretable compositional structures with statistical guarantees. Together they frame a tension in scientific ML: neural methods scale but don't guarantee mechanistic fidelity, while symbolic methods offer interpretability but face tractability questions. This paper chips away at the tractability concern, but only for well-behaved operator vocabularies, leaving the mechanistic fidelity question untouched.

If researchers applying this framework to real scientific domains (physics, chemistry, biology) report that finding Lipschitz-smooth operator vocabularies is harder than the theory suggests, the practical gap remains despite the theoretical advance. Conversely, if at least two published applications in the next 18 months cite this bound to justify their vocabulary design choices and report successful discovery, the theory-practice bridge is real.

Coverage we drew on

Reliability, Faithfulness, and the Limits of Post-hoc Explanations of Opaque Scientific Models · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsPAC learning · symbolic regression · Rademacher complexity · compositional function trees

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.