Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Researchers tested whether persona-based prompting actually diversifies LLM outputs in urban perception tasks, finding that agents reliably reproduce behavior within a persona but show minimal variation across personas. The work exposes a critical gap between the intuitive appeal of persona prompting and its practical effect, suggesting that LLMs may converge on similar judgments regardless of demographic framing. This matters for practitioners deploying multimodal models as proxies for human diversity in social science and urban planning, where persona-driven differentiation is often assumed but not validated.

Modelwire context

Explainer

The deeper problem isn't just that persona prompting fails to diversify outputs, it's that the field has been using this technique as a substitute for actual human survey data in high-stakes planning contexts, often without any validation step like the one this paper finally runs.

This connects to a broader pattern Modelwire has been tracking around LLMs being deployed as measurement instruments rather than generative tools. The PLOS and DataSeer work covered here on April 30 (the scholarly data reuse paper) shows a parallel dynamic: when LLMs are used to quantify real-world behavior at scale, the assumptions baked into the methodology determine whether the output reflects reality or just reflects the model's priors. In both cases, the risk is that practitioners trust the output because it looks systematic. The urban perception paper makes that risk concrete: if persona-framed agents converge on similar judgments regardless of demographic input, then any study using them to simulate diverse public opinion is measuring model bias, not human variation.

Watch whether urban planning or social science venues that have already published persona-agent studies issue methodological caveats or replication checks in the next two conference cycles. If they don't, this paper's findings are being absorbed slowly enough that the flawed methodology will compound in the literature before correction arrives.

Coverage we drew on

Measuring research data reuse in scholarly publications using generative artificial intelligence: Open Science Indicator development and preliminary results · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsPerceptSent dataset · multimodal LLMs

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.