Research Tools & Code·arXiv cs.CL·Apr 26

AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Illustration accompanying: AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Mechanistic interpretability research on LLM emotion has faced a fundamental confound: probes trained on phrases like 'I am furious' cannot distinguish between detecting anger circuits versus simply recognizing emotion keywords. Researchers have released AIPsy-Affect, a 480-item clinical battery of narrative vignettes that evoke Plutchik's eight primary emotions through situational context alone, eliminating keyword bias at the stimulus level. This addresses a critical methodological gap in activation patching, SAE feature analysis, and steering vector work, enabling cleaner causal claims about how models represent and process affect. The resource matters for anyone building interpretability tools or making claims about emotion-related model behavior.

Modelwire context

Explainer

The battery's clinical framing is worth pausing on: these are narrative vignettes drawn from a psychometric tradition, meaning the validity bar is borrowed from human psychological assessment rather than invented for NLP convenience. That lineage gives the resource a kind of external credibility that most LLM benchmarks lack, but it also means the implicit assumption is that models process situational emotional context the way humans do, which remains unproven.

The keyword-confound problem AIPsy-Affect targets sits directly adjacent to the challenge covered in 'Modeling Induced Pleasure through Cognitive Appraisal Prediction via Multimodal Fusion' from the same day, where researchers similarly argued that surface-level sentiment classification misses the appraisal layer that actually drives emotional response. Both papers are pushing the field toward stimulus and representation designs grounded in cognitive theory rather than label convenience. Together they suggest a quiet methodological turn in affective AI: the community is growing uncomfortable with proxies and is reaching back into psychology for more principled scaffolding.

The real test is whether interpretability teams running SAE feature analysis or activation patching on frontier models adopt AIPsy-Affect as a standard stimulus set within the next two release cycles. If it stays confined to the paper's own experiments, the battery is a contribution without uptake; if external groups cite it as a control condition, the methodological argument has landed.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsAIPsy-Affect · Plutchik · sparse autoencoders · linear probing · activation patching

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.