Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Researchers have identified how over-parameterized neural networks simultaneously memorize noisy training labels while maintaining strong generalization, a paradox central to modern deep learning. Using modular arithmetic as a controlled testbed, the work reveals that larger models suppress internal generalization structures to fit corrupted data, yet these structures remain extractable even under 80% label noise. This finding reshapes understanding of the memorization-generalization tradeoff and has direct implications for training robust models in real-world settings where label quality is imperfect, suggesting practitioners can recover clean signal from heavily corrupted datasets through architectural and optimization choices.

Modelwire context

Explainer

The key detail the summary underplays is the 80% noise threshold: the fact that generalizing structures remain extractable even when four out of five labels are corrupted suggests the network is doing something more structured than brute-force memorization, and that the two behaviors may occupy separable regions of the model rather than competing for the same capacity.

This connects most directly to the variance reduction work covered the same day ('New Insight of Variance Reduce in Zero-Order Hard-Thresholding'), which also grapples with how optimization noise interacts with the structures a model is trying to learn. Both papers are essentially asking: when training signal is corrupted or approximate, what survives and what doesn't? The FedSDR coverage is also relevant here, since federated settings routinely involve heterogeneous and noisy label distributions across clients, and knowing that clean structure is recoverable under heavy noise has direct bearing on whether self-distillation approaches are fishing signal from genuinely corrupted local data.

The real test is whether the extractable generalization structures identified on modular arithmetic transfer to less controlled tasks, specifically whether a follow-up replicates the finding on a standard vision or language benchmark with synthetic label corruption at comparable noise rates within the next six to twelve months.

Coverage we drew on

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsNeural networks · Modular arithmetic · Label noise · Over-parameterization

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.