Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Researchers introduce Verbal Process Supervision, a training-free method that uses natural-language critique from a stronger model to iteratively refine reasoning. The technique pushes GPT-5.4 to 94.9% on GPQA Diamond and lifts weak models from 11.7% to 90% on AIME 2025, establishing a new inference-time scaling axis beyond chain-of-thought depth and sample breadth.
Modelwire context
ExplainerThe key distinction buried in the framing is 'training-free': Verbal Process Supervision doesn't update model weights, which means the gains come entirely from how inference is structured, not from any new capability baked into the model. That makes the 11.7% to 90% jump on AIME 2025 either a compelling demonstration of how much headroom exists at inference time, or a sign that AIME 2025 may be more susceptible to iterative critique than its difficulty rating implies.
This connects most directly to the AEL paper also published April 23, which tackles a structurally similar problem from a different angle: where AEL asks how agents can improve across episodes by retaining memory and diagnosing failure patterns, Verbal Process Supervision asks how a single reasoning chain can improve within one inference pass through iterative critique. Both are responses to the same ceiling: chain-of-thought alone stops scaling cleanly. The broader archive here doesn't yet have coverage of competing process supervision approaches like outcome-reward models or tree search methods, so readers should treat this as an entry point into a contested space rather than a settled one.
Watch whether these benchmark gains replicate on held-out GPQA Diamond splits released after April 2026 — if they don't, eval contamination is the most likely explanation given the training-free claim. Also watch whether any lab publishes a direct comparison against outcome-based reward models on AIME 2025 within the next two months, which would clarify whether verbal critique adds anything beyond what scalar reward signals already provide.
Coverage we drew on
- AEL: Agent Evolving Learning for Open-Ended Environments · arXiv cs.CL
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsGPT-5.4 · GPQA Diamond · AIME 2025 · LiveCodeBench V6 · Verbal Process Supervision
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.