Language Models as Measurement Apparatus for Culture

A new paper reframes how language models measure culture, arguing that the measurement apparatus itself shapes the cultural phenomena it claims to observe rather than neutrally recording them. Using Karen Barad's agential realism framework, the work demonstrates that model architecture, training data, and annotation choices constitute contingent boundaries between instrument and phenomenon. The research challenges the assumption that LLMs passively quantify culture, showing instead that models have already internalized the cultural material they measure, making measurement inherently entangled with construction. This matters for practitioners building cultural measurement systems: design choices carry epistemological weight and cannot be treated as implementation details.

Modelwire context

Explainer

The paper's core move is inverting the measurement problem: instead of asking 'how well do LLMs measure culture?', it asks 'what does the act of measurement itself construct?' This reframing matters because it treats design choices as constitutive rather than neutral, which is a different claim than saying models have biases.

This connects directly to the MSQA benchmark work from last week, which found that cultural competence degrades sharply in multilingual models despite language fluency. That paper showed the symptom (performance gaps on culturally grounded questions); this one provides the theoretical apparatus for understanding why: the model architecture and training pipeline don't passively discover culture, they actively shape what counts as measurable culture in the first place. The Taboo game study also fits here, since it demonstrates how inference-time constraints reshape model behavior, suggesting that measurement and construction are entangled at every layer, not just training.

If researchers using this framework produce cultural measurement systems that explicitly document their boundary-drawing choices (annotation schemes, data selection rationale, architectural decisions) and show that downstream practitioners who adopt those systems produce more reproducible cultural inferences than teams using 'standard' LLM approaches, that validates the practical claim. If the framework remains theoretical without shifting how cultural benchmarks are actually built in the next 12 months, it's a useful critique but not yet actionable.

Coverage we drew on

MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsKaren Barad · Language models · NLP

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.