Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Researchers have identified and mapped emotion vectors across open-weight LLMs, replicating earlier findings from Claude while uncovering architectural differences in how models encode emotional valence. Apertus-8B and Gemma-4-E4B both exhibit valence geometry comparable to proprietary systems, but diverge sharply in layer-wise representation patterns, with Gemma concentrating emotional encoding early then discarding it deeper in the network. This work extends mechanistic interpretability beyond closed models and raises questions about whether emotional structure emerges necessarily from scale or reflects deliberate architectural choices, informing both safety research and model design.

Modelwire context

Explainer

The more consequential finding isn't that emotion vectors exist in open models, it's that Gemma's architecture appears to actively discard emotional encoding in deeper layers, suggesting emotional structure may be an early-layer heuristic that later processing overrides rather than a stable representational feature throughout the network.

This sits inside a growing cluster of mechanistic interpretability work that is quietly shifting how the field thinks about what models actually encode versus what they output. The connection to our coverage of 'Towards Explainable Adjudicative Variance' is instructive: both papers are probing internal model structure to explain behavior that benchmarks alone cannot capture. More directly, the reasoning flexibility work covered in 'The Riddle Riddle' raises a parallel question: if models pattern-match rather than reason, do their internal emotional representations reflect genuine valence encoding or surface-level statistical regularities inherited from training corpora? This paper doesn't answer that, but the architectural divergence between Apertus and Gemma suggests the answer may vary by model family.

Watch whether safety-focused teams at Mistral or Google DeepMind publish follow-up probing studies on Gemma's early-layer emotional encoding within the next six months. If they confirm the discard pattern holds across Gemma variants, that would suggest a deliberate architectural choice worth scrutinizing in safety audits.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsClaude Sonnet 4.5 · Apertus-8B-Instruct-2509 · Gemma-4-E4B-it · arXiv

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.