UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Researchers propose UDM-GRPO, the first framework combining Uniform Discrete Diffusion Models with reinforcement learning for stable training. The method treats final samples as actions and reconstructs diffusion trajectories to align with pretraining distributions, plus introduces efficiency strategies that outperform naive GRPO integration.
MentionsUDM-GRPO · Uniform Discrete Diffusion Model · GRPO · Reduced-Step · CFG-Free
Read full story at arXiv cs.LG →(arxiv.org)
Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.