Modelwire
Subscribe

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Illustration accompanying: Three Models of RLHF Annotation: Extension, Evidence, and Authority

A new framework unpacks the philosophical foundations of RLHF annotation by distinguishing three competing models of human judgment's role in LLM alignment. The extension model treats annotators as proxies for designer intent, evidence treats them as independent oracles on facts or values, and authority grants them representative power over outputs. These distinctions carry concrete implications for pipeline design, annotation collection, and result aggregation. The work matters because current RLHF practice rarely makes these assumptions explicit, leaving teams vulnerable to misaligned incentives and conflicting validation logic downstream.

Modelwire context

Explainer

The paper's sharpest contribution isn't the taxonomy itself but the diagnosis: most RLHF teams are already committed to one of these models without knowing it, which means their validation logic and their collection logic are often pulling in opposite directions by design.

This connects most directly to the 'paradox of AI fluency' coverage from late April, which showed that user sophistication shapes model interaction in ways that standard evaluation frameworks don't capture. That finding is partly a symptom of the problem this paper names: if annotators are treated as proxies for designer intent (the extension model) but the actual user population behaves like skilled collaborators iterating on hard problems, the reward signal is calibrated to the wrong population entirely. The fluency paradox piece documented the gap at the product layer; this paper offers a vocabulary for tracing that gap back to annotation design choices made much earlier in the pipeline.

Watch whether any major RLHF tooling providers (Scale AI, Surge, or internal teams at frontier labs) publish annotation guidelines that explicitly name which model they're operating under within the next 12 months. Adoption of the vocabulary in practitioner documentation would be a concrete signal that the framework is doing real work rather than staying inside academic discourse.

Coverage we drew on

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsRLHF · LLM

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Three Models of RLHF Annotation: Extension, Evidence, and Authority · Modelwire