
Misaligned by Reward: Socially Undesirable Preferences in LLMs
Researchers have exposed a critical gap in how reward models used to align LLMs are evaluated. Current benchmarks focus narrowly on instruction-following, missing whether these proxies for human preference actually capture socially desirable behavior. A new framework tests reward models across bias, safety, morality, and ethical reasoning by converting social datasets into preference pairs, revealing whether alignment training inadvertently encodes socially harmful outputs. This matters because reward models are foundational to RLHF pipelines at every major lab, and hidden social misalignment could propagate through deployed systems at scale.62














