Research·arXiv cs.LG·May 8

Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling

Researchers propose R-SGD-Mini, a stochastic gradient descent variant designed to handle heavy-tailed noise distributions where variance may be infinite, a common challenge in real-world training data. Rather than relying on gradient clipping or normalization, the method partitions mini-batches into chunks and selects gradients via medoid sampling, a robustness technique borrowed from robust statistics. This addresses a practical pain point in large-scale optimization: noisy, outlier-prone data that destabilizes standard first-order methods. The approach could improve training stability for models operating on unfiltered or adversarial data streams, relevant to practitioners scaling models on messy real-world datasets.

Modelwire context

Explainer

The key insight is that R-SGD-Mini avoids gradient clipping entirely by using medoid sampling (selecting the most central gradient in each chunk) rather than normalization. This is a robustness-through-selection approach, not a robustness-through-dampening one, which is a different architectural choice than most prior work.

This paper sits alongside the gradient starvation work from earlier today and the DTW-certified anomaly detection piece. All three address instability in first-order methods when data or signals are pathological: gradient starvation collapses learning signals to zero, DTW certification handles temporal adversarial deformation, and R-SGD-Mini handles outlier-prone noise that breaks variance assumptions. The medoid approach here is philosophically similar to the fixed-reference fix in GRPO (both preserve signal by selecting rather than averaging), but applied to a different failure mode. The difference is scope: GRPO targets a specific RL algorithm, while R-SGD-Mini targets a broad class of noisy optimization problems.

If practitioners report successful training on datasets with documented heavy-tailed noise (e.g., web-scraped text with extreme outlier tokens, or sensor data with spike artifacts) using R-SGD-Mini without gradient clipping, and if convergence rates match or exceed clipped baselines on standard benchmarks, that confirms the method works in practice. If it only helps on synthetic heavy-tailed distributions, it remains a theoretical contribution.

Coverage we drew on

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsR-SGD-Mini

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.