LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

Researchers introduce LEPO, a reinforcement learning framework that applies policy optimization directly to continuous latent representations in LLMs by injecting controllable stochasticity via Gumbel-Softmax. The method restores exploration capacity lost in deterministic latent reasoning, enabling RL training on hidden model states rather than token sequences.
Modelwire context
ExplainerThe core bet LEPO makes is that the reasoning bottleneck in current RL-trained models isn't reward design or data quality — it's that deterministic latent representations give the optimizer nothing to explore. Gumbel-Softmax is borrowed from discrete variational inference and repurposed here as a stochasticity injection mechanism, which is a less obvious choice than simply adding noise to activations.
Two papers published the same day are working adjacent problems. HEAL ('HEALing Entropy Collapse,' arXiv cs.LG, April 20) attacks exploration failure in few-shot RL for language models from the data-mixing side, while LEPO attacks it from the representation side. They're not competing approaches so much as two different places to intervene in the same failure mode: entropy collapse during RL training. The Calibrated Attempt-Level GRPO paper from the same date adds a third angle, fixing gradient bias across reasoning attempts rather than addressing exploration directly. Together, these three papers suggest that RL training instability in reasoning models is currently a crowded research target, with no consensus yet on where the primary fix belongs.
If LEPO's latent-space approach shows additive gains when combined with HEAL-style entropy alignment on a shared reasoning benchmark like MATH-500 within the next two conference cycles, that would indicate the representation and data-mixing interventions are targeting genuinely separate failure modes rather than the same underlying issue.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsLEPO · Gumbel-Softmax · LLMs
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.