Research Models & Releases·arXiv cs.CL·May 6

StoryAlign: Evaluating and Training Reward Models for Story Generation

Researchers have identified a critical gap in how reward models evaluate narrative quality, introducing StoryRMB, the first benchmark specifically designed to measure human preference alignment in story generation. The work reveals that existing reward models fail to capture what makes stories compelling to readers, a limitation that directly impacts RLHF training pipelines for narrative tasks. This matters because story generation represents a frontier for testing whether LLMs can handle subjective, structurally complex outputs beyond factual text, and effective preference modeling here could unlock better training methods for other creative domains.

Modelwire context

Explainer

The paper doesn't just identify that reward models fail on stories; it proposes StoryAlign as a training methodology, not just a benchmark. That distinction matters because prior work (Themis, FinSafetyBench) focused on evaluation and red-teaming, whereas this work closes the loop by showing how to use preference data to actually improve RM performance on narrative tasks.

This extends the pattern established by Themis (code reward models, May 1) and the multilingual safety work (ML-Bench, May 1) by asking the same core question across a new domain: can we build reward models that genuinely capture human preference in a subjective, structurally complex task? Where those benchmarks exposed gaps in existing RMs, StoryAlign goes further by proposing a training fix. The work also echoes the goblin incident (OpenAI ChatGPT misalignment, May 1) in reverse: instead of showing how bad reward signals produce artifacts, it shows how to engineer better signals. The constraint-based reasoning paper (Structure Liberates, May 1) is adjacent but distinct; that work scaffolds LLM ideation, whereas this scaffolds how we train models to evaluate ideation.

If StoryAlign-trained reward models show preference alignment gains that hold up on held-out human raters from a different demographic or cultural background than the training set, the approach is robust. If the gains collapse on out-of-distribution stories (e.g., genre shifts from literary fiction to fan fiction), that signals the method is overfitting to the benchmark's implicit narrative assumptions, which would undermine claims about generalized preference learning.

Coverage we drew on

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsStoryRMB · StoryAlign · LLMs · reward models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.