Research Tools & Code·arXiv cs.LG·May 5

On Adaptivity in Zeroth-Order Optimization

Researchers challenge the conventional wisdom that adaptive optimization methods like ZO-Adam outperform simpler alternatives for memory-constrained LLM fine-tuning. The work reveals that high-dimensional zeroth-order gradients lack the coordinate-wise variation that makes adaptive mechanisms worthwhile, leading to wasted memory. The proposed MEAZO optimizer achieves parity with ZO-Adam while tracking only a single scalar, addressing a practical bottleneck in resource-limited LLM training. This finding reshapes the cost-benefit calculus for practitioners optimizing under memory constraints and suggests the field has been over-engineering solutions to a problem that doesn't exist at scale.

Modelwire context

Explainer

The paper's core claim is not that MEAZO works well, but that adaptive methods like ZO-Adam were solving a problem that doesn't actually exist in zeroth-order settings. The finding hinges on a specific property of high-dimensional gradient noise: it lacks the per-coordinate heterogeneity that makes momentum and second-moment tracking valuable.

This connects directly to the May 1st work on randomized subspace acceleration, which also targets efficiency in gradient computation under resource constraints. Where that paper optimizes *how* gradients are computed (via low-dimensional projection), this one optimizes *what* state to maintain during optimization. Both are addressing the same infrastructure bottleneck (memory and bandwidth in constrained training), but from different angles. The MEAZO result also echoes the broader pattern from the MemCoE paper: when you interrogate what's actually necessary versus what's conventionally assumed, you often find simpler solutions work.

If practitioners report that MEAZO matches ZO-Adam on standard LLM fine-tuning benchmarks (GLUE, SuperGLUE) within the next two quarters, the finding holds. If adaptive methods retain an edge on any of those tasks, it suggests the theoretical result doesn't generalize to realistic model scales or that coordinate variation does emerge in practice despite the theory.

Coverage we drew on

Randomized Subspace Nesterov Accelerated Gradient · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsMEAZO · ZO-Adam · ZO-SGD

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.