Research·arXiv cs.LG·May 8

Post-training makes large language models less human-like

A new benchmark reveals that instruction-tuning and reinforcement learning from human feedback, the standard post-training pipeline that converts base models into usable assistants, systematically erodes behavioral alignment with human psychology. Across model families and scales, this misalignment actually widens in newer generations despite improvements to base capabilities, suggesting a fundamental tension between usefulness and human-like reasoning. The finding undermines a common assumption in behavioral science: that conditioning models on individual participant profiles can recover human-level prediction accuracy. For researchers using LLMs as cognitive proxies and for teams building human-aligned systems, this signals that current optimization targets may be steering models away from authentic human behavior patterns.

Modelwire context

Explainer

The sharpest finding isn't just that post-training reduces human-likeness, it's that the gap widens with each model generation, meaning the problem compounds as capabilities improve. That directional trend is what makes this structurally difficult to patch with better data curation alone.

This connects directly to the GRPO gradient starvation paper covered the same day, which identified how binary reward signals in RL training create degenerate optimization dynamics. If the reward signal itself is structurally misaligned with human behavioral distributions, fixing training stability (as that paper proposes) may actually accelerate the divergence this benchmark documents. The Bayesian fine-tuning in projected subspaces work is also relevant here: uncertainty quantification during adaptation could, in principle, flag when a model's behavioral distribution is drifting from human baselines, though neither paper makes that connection explicitly.

Watch whether the Psych-201 benchmark gets adopted by any major post-training team as an evaluation gate within the next two release cycles. If it doesn't appear in a model card or alignment report by end of 2026, that's a signal the field is treating this as a research curiosity rather than a deployment constraint.

Coverage we drew on

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsPsych-201 · LLMs · RLHF

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.