Implicit Representations of Grammaticality in Language Models

Researchers probed whether language models develop an internal notion of grammaticality separate from raw token probability. Using linear probes on synthetic ungrammatical perturbations, they discovered LMs do encode grammatical structure as a distinct representational feature, even though surface probabilities conflate grammaticality with corpus likelihood. This finding matters for interpretability: it suggests neural language models acquire linguistic abstractions beyond next-token prediction, reshaping how we understand what these systems actually learn versus what they merely memorize.

Modelwire context

Explainer

The key methodological detail the summary skips: using synthetic perturbations rather than naturally occurring ungrammatical text is a deliberate choice that controls for corpus frequency effects, which is precisely what makes the grammaticality signal separable from raw likelihood in the first place. Without that control, you cannot distinguish 'the model knows this is wrong' from 'the model just rarely saw this string.'

This paper sits in direct conversation with two recent pieces in the archive. The 'Beyond Decodability' encoding probe paper from May 1st attacked the same core problem from the opposite direction: instead of decoding features from representations, it reconstructed representations from features. Both papers are essentially asking whether probing methodology is rigorous enough to support causal claims about what models learn. The MIT superposition piece from May 3rd adds a complementary layer, suggesting that the very mechanism enabling models to store many features in compressed form is what makes grammatical abstractions recoverable at all. Together, these three papers sketch a more coherent picture of how linguistic structure gets encoded, stored, and retrieved inside transformers.

The real test is whether these grammaticality probes generalize to naturally occurring errors in the wild, not just controlled synthetic perturbations. If a follow-up study applies the same linear probe methodology to learner corpora or speech transcripts and the signal holds, the abstraction claim becomes substantially stronger.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLanguage models · Linear probes · Grammaticality

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.