Negation Neglect: When models fail to learn negations in training

Researchers have identified a critical failure mode in large language model finetuning where models internalize false claims despite explicit negations in training data. When Qwen3.5-397B was finetuned on documents repeatedly flagging fabricated statements as false, belief rates jumped from 2.5% to 88.6%, suggesting models may conflate frequency of claim mention with truth regardless of negation markers. This finding exposes a fundamental gap between contextual understanding and training-time knowledge absorption, with implications for how organizations deploy finetuned models in safety-critical applications and raises questions about whether current architectures can reliably distinguish negated from affirmed propositions during parameter updates.

Modelwire context

Explainer

The deeper problem here isn't that models misread negations at inference time, which is a known weakness. It's that the finetuning process itself may be treating token co-occurrence frequency as a proxy for truth, meaning the more often a false claim appears in training documents, even documents explicitly refuting it, the more confidently the model asserts it afterward.

This connects directly to the theoretical concerns raised in 'What is Learnable in Valiant's Theory of the Learnable,' which reframed learnability around strict correctness constraints and the limits of supervised learning. Negation Neglect is almost a concrete empirical case of those limits: a system trained under standard supervised objectives fails to respect a logical operator that humans treat as foundational. It also adds a cautionary data point for the weight-space communication work in 'Good Agentic Friends Do Not Just Give Verbal Advice,' where agents directly update each other's parameters. If finetuning on negated claims corrupts beliefs, weight-space updates between agents could propagate misinformation at a layer below any token-level audit.

The critical test is whether this failure mode holds across architectures smaller than Qwen3.5-397B. If mid-scale models in the 7B to 70B range show the same belief-rate inversion, the problem is architectural rather than a quirk of scale, and standard finetuning pipelines for safety-critical applications will need formal review.

Coverage we drew on

What is Learnable in Valiant's Theory of the Learnable? · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsQwen3.5-397B · Negation Neglect

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.