Research Models & Releases·arXiv cs.CL·Jun 26

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

Researchers propose HPRO, a hierarchical framework addressing a fundamental tension in LLM-based text-to-speech: balancing semantic fidelity with emotional prosody. The core insight targets a real problem in preference-driven optimization where content and emotion objectives create conflicting gradients, causing reward hacking and degradation. By decoupling these signals across hierarchical levels and introducing frame-level guidance, the work tackles why current TTS models default to emotionally flat, averaged outputs despite training on preference data. This matters for anyone building expressive speech systems, as it exposes how naive reward alignment can backfire in multi-objective generation tasks.

Modelwire context

Explainer

The paper's actual contribution is narrower than it might appear: it's not that LLM-based TTS can't do emotion, but that standard preference optimization creates conflicting gradients that push models toward safe, emotionally muted outputs. The fix is hierarchical decoupling, not a new architecture.

This work mirrors the mechanistic insight from the Vision-Language Models causal study published the same day. Both papers identify that when a model faces competing objectives (visual grounding vs. knowledge retrieval there; semantic fidelity vs. prosody here), the training signal doesn't naturally resolve the conflict. Instead of averaging or hedging, the VLM paper showed how to map the sparse attention heads controlling arbitration. HPRO takes a different path: it structures the optimization itself to prevent the conflict from arising. Both assume the model can learn both capabilities; the question is whether the training process lets it.

If HPRO's hierarchical approach generalizes to other multi-objective TTS tasks (e.g., speaker identity plus emotion, or style plus intelligibility), that confirms the decoupling principle is robust. If performance gains collapse when tested on out-of-distribution emotion labels not seen during preference annotation, that signals the method is just memorizing the training distribution rather than solving the underlying optimization problem.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsHPRO · HD-Emo codec · LLM-based TTS

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.