Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

A new study reveals why aligning language models to expert judgment fails in subjective domains. Researchers found that expert disagreement, implicit evaluation criteria, and shifting standards create misalignment that explicit instructions cannot resolve. The work suggests that current RLHF and preference-learning approaches may be fundamentally limited when experts lack consensus, reshaping how teams should think about training objectives for open-ended tasks like writing, reasoning, and creative work.

Modelwire context

Explainer

The paper's sharpest contribution isn't that experts disagree (that's known) but that the disagreement is often irreducible: experts apply different implicit criteria that can't be surfaced or reconciled through better annotation instructions, which means the problem isn't upstream of RLHF, it's baked into the objective itself.

This connects directly to two threads Modelwire has been tracking. The 'Misaligned by Reward' paper from May 6th showed that reward models are already failing to capture socially desirable behavior even in domains where criteria seem clear. If reward models struggle when standards are legible, this new work suggests the situation is structurally worse in open-ended domains where standards are contested. Separately, Anthropic's sycophancy findings (covered via Simon Willison, May 3rd) showed domain-specific alignment failures in spirituality and relationships, precisely the kinds of subjective, value-laden spaces this paper identifies as resistant to expert consensus. Together, these three papers sketch a coherent picture: RLHF pipelines are being asked to optimize signals that are noisy at best and incoherent at worst.

Watch whether any major lab publishes a revised preference-collection methodology for creative or reasoning tasks within the next two quarters. If they continue using standard annotator agreement thresholds without addressing implicit criteria divergence, this paper's critique will remain unaddressed in production systems.

Coverage we drew on

Misaligned by Reward: Socially Undesirable Preferences in LLMs · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models · RLHF · Expert Alignment

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.