Research Models & Releases·arXiv cs.CL·5d ago

COMPOSE: Composing Future Theorems from Citations and Formal Structure

Researchers propose COMPOSE, a dual-graph neural framework that grounds mathematical theorem generation in both citation networks and formal proof dependencies. Rather than treating scientific motivation and logical validity as separate concerns, the system conditions language models on aligned graphs from both domains, addressing a fundamental gap in how LLMs reason about mathematical futures. This work signals growing sophistication in using structured knowledge to constrain and guide generative models beyond raw pattern matching, with implications for formal verification, automated discovery, and how AI systems can leverage domain-specific constraints to produce valid rather than merely plausible outputs.

Modelwire context

Explainer

COMPOSE's actual contribution is narrower than the framing suggests: it's not that LLMs should consider both citation and proof graphs, but that *aligning* those two graphs during training prevents the model from generating theorems that are scientifically motivated but logically invalid (or vice versa). The paper doesn't claim to solve mathematical discovery; it claims to reduce a specific failure mode in how LLMs reason about what comes next in a research trajectory.

This sits directly alongside the coherence work from late May ('Locally Coherent, Globally Incoherent'). That paper identified how multi-component LLM systems can satisfy local constraints while violating global axioms. COMPOSE tackles the same coherence problem but at the single-model level: ensuring that when an LLM generates a new theorem, it doesn't satisfy the citation graph (looks like real research) while breaking the proof graph (is actually invalid). Both papers treat validity as something that must be *enforced during generation*, not checked afterward. The difference is scope: one works across agent boundaries, the other within a single model's output space.

If COMPOSE's dual-graph approach shows measurable improvement on held-out theorem generation (theorems that cite recent papers but were not in training), and if that improvement persists when tested on formal verification systems like Lean or Coq, then the alignment strategy is doing real work. If the gains disappear on theorems older than 2024 or on synthetic benchmarks, the model may simply be pattern-matching on recent citation clusters rather than learning principled reasoning about mathematical futures.

Coverage we drew on

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCOMPOSE · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.