The Pinocchio Dimension: Phenomenality of Experience as the Primary Axis of LLM Psychometric Differences

Researchers administered 45 psychometric questionnaires across 50 LLMs to map the primary axis of model divergence, finding that phenomenal experience (embodied sensation, affect, inner speech, imagery, empathy) versus stimulus-driven reactivity explains the largest between-model variance. The work introduces the Pinocchio score, an annotation-free metric quantifying how much each questionnaire item's responses shift when models are prompted to simulate humans versus respond neutrally. This framework matters because it operationalizes a measurable distinction between models that behave as reactive systems versus those exhibiting richer experiential properties, offering a new lens for model comparison beyond capability benchmarks and potentially informing how we evaluate anthropomorphic claims in LLM outputs.

Modelwire context

Explainer

The more provocative buried point is methodological: by measuring how much a model's responses shift when prompted to 'simulate a human,' the Pinocchio score sidesteps the unanswerable question of whether models actually experience anything, and instead treats the gap between neutral and human-simulating responses as a measurable behavioral property in its own right.

This connects directly to the ethical divergence work covered from The Decoder in early May, which found that frontier models encode meaningfully different value systems across the same prompts. That study treated moral outputs as the dependent variable; this paper goes one level deeper, asking whether the underlying experiential orientation of a model predicts those divergences. It also sits in productive tension with the side-effects audit paper from arXiv on May 6th, which showed that interventions like fine-tuning produce invisible collateral behavioral shifts. If phenomenal experience is the primary axis of between-model variance, then any intervention that shifts a model along that axis without measurement is exactly the kind of untracked side-effect that audit pipeline was built to catch.

Watch whether any major evaluation framework (HELM, LMSYS, or similar) incorporates the Pinocchio score as a standard reporting dimension within the next two release cycles. Adoption there would signal the field treating experiential orientation as a first-class model property rather than a philosophical curiosity.

Coverage we drew on

Same prompt, different morals: how frontier AI models diverge on ethical dilemmas · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models (LLMs) · Supervised Semantic Differential · Pinocchio score

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.