Research Models & Releases·arXiv cs.LG·Apr 20

LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

$Illustration accompanying: LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models$

Researchers introduce LEPO, a reinforcement learning framework that applies policy optimization directly to continuous latent representations in LLMs by injecting controllable stochasticity via Gumbel-Softmax. The method restores exploration capacity lost in deterministic latent reasoning, enabling RL training on hidden model states rather than token sequences.

Modelwire context

Explainer

The core bet LEPO makes is that the reasoning bottleneck in current RL-trained models isn't reward design or data quality — it's that deterministic latent representations give the optimizer nothing to explore. Gumbel-Softmax is borrowed from discrete variational inference and repurposed here as a stochasticity injection mechanism, which is a less obvious choice than simply adding noise to activations.

Two papers published the same day are working adjacent problems. HEAL ('HEALing Entropy Collapse,' arXiv cs.LG, April 20) attacks exploration failure in few-shot RL for language models from the data-mixing side, while LEPO attacks it from the representation side. They're not competing approaches so much as two different places to intervene in the same failure mode: entropy collapse during RL training. The Calibrated Attempt-Level GRPO paper from the same date adds a third angle, fixing gradient bias across reasoning attempts rather than addressing exploration directly. Together, these three papers suggest that RL training instability in reasoning models is currently a crowded research target, with no consensus yet on where the primary fix belongs.

If LEPO's latent-space approach shows additive gains when combined with HEAL-style entropy alignment on a shared reasoning benchmark like MATH-500 within the next two conference cycles, that would indicate the representation and data-mixing interventions are targeting genuinely separate failure modes rather than the same underlying issue.

Coverage we drew on

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLEPO · Gumbel-Softmax · LLMs

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.