Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Researchers propose a visual-native agent architecture that treats images as persistent, referenceable objects rather than ephemeral search outputs, enabling later tools to build on intermediate visual evidence. The work also introduces on-policy data evolution to align training corpora with an agent's improving capabilities over time. This addresses a fundamental limitation in current multimodal reasoning systems where visual context is discarded after initial retrieval, constraining the depth of chained reasoning across text and image modalities.
Modelwire context
ExplainerThe key insight is treating visual evidence as persistent, referenceable objects that downstream tools can build upon, rather than discarding images after retrieval. This is paired with on-policy data evolution, which continuously updates training data as the agent's capabilities improve, rather than freezing the corpus at training time.
This work sits at the intersection of two recent threads in agent research. The visual grounding problem echoes the BICR paper from May 11, which exposed how LVLMs often ignore images entirely and rely on text priors alone; this paper inverts that concern by ensuring images remain accessible throughout the reasoning chain. More directly, it complements the Dynamic Skill Lifecycle Management framework (also May 11), which argued that agent capabilities should be actively managed rather than static. Here, the agent's training data itself becomes dynamic, adapting to what the system can actually do at each stage rather than assuming a fixed corpus suffices across all capability levels.
If follow-up work demonstrates that on-policy data evolution reduces the number of training steps needed to reach equivalent performance on long-horizon multimodal tasks (measured on WildClawBench-style real-world benchmarks), that would confirm the efficiency gain. Conversely, if performance plateaus or diverges when the corpus update frequency is reduced, the approach's practical viability for resource-constrained teams becomes questionable.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsVisual-native agent harness · Image bank reference protocol · On-policy data evolution · Multimodal deep search agents
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.