Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Researchers propose POISE, a method that extracts baseline signals directly from a language model's hidden states during policy training, sidestepping the computational overhead that plagues existing reinforcement learning approaches for reasoning models. By training a lightweight probe on internal activations rather than maintaining a separate critic network or running multiple rollouts, the technique cuts variance reduction costs substantially while maintaining gradient integrity. This addresses a real bottleneck in scaling RL for large reasoning systems, where baseline estimation has become a material efficiency constraint.

Modelwire context

Explainer

The key move here is not just efficiency: by sourcing the baseline signal from the actor's own internal representations rather than an external critic, POISE avoids the value misalignment problem where a separate network's estimates diverge from the policy being trained, which is a known instability source in PPO-style setups for long-horizon reasoning tasks.

This connects directly to the thread running through recent coverage of reasoning model internals. The CIKA paper from the same day ('Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery') also treats a model's internal states as a diagnostic surface, asking what those states actually encode about concept mastery versus surface correlation. POISE takes a complementary angle: rather than reading internal states for interpretability, it reads them to reduce training variance. Together, these papers suggest a broader shift toward treating hidden activations as first-class signals in the training loop, not just post-hoc analysis artifacts. The Transformer parameterization work from the same period ('Revisiting Transformer Layer Parameterization Through Causal Energy Minimization') adds further context, since principled use of internal structure is a theme across all three.

The real test is whether POISE's probe generalizes across model families without retraining: if the lightweight probe transfers to a held-out architecture at comparable variance reduction, the internal-state approach is robust; if it requires per-model recalibration, the efficiency gains shrink considerably in multi-model deployment settings.

Coverage we drew on

Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Reasoning Models · PPO · GRPO · POISE

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.