Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Video2GUI addresses a critical bottleneck in GUI agent development by automating the extraction of training trajectories from unlabeled web video at scale. Rather than relying on expensive manual annotation, the framework mines 500 million video metadata entries to construct WildGUI, a large-scale dataset spanning diverse real-world applications. This approach directly tackles generalization constraints that have limited multimodal LLM-based agents to narrow domains, potentially unlocking a new class of foundation datasets for autonomous interface interaction. The shift from curated to synthetically derived training data mirrors broader trends in scaling AI through automated data pipelines.
Modelwire context
ExplainerThe deeper technical bet here is that visual grounding from raw screen recordings can substitute for semantically labeled interaction logs, which sidesteps the annotation bottleneck without requiring access to application internals or instrumented browsers. Whether that substitution holds across diverse UI layouts and interaction types is the open question the paper's benchmark numbers don't fully answer.
The data construction challenge Video2GUI addresses is structurally similar to the problem described in the 'Cognitive-Uncertainty Guided Knowledge Distillation' paper from the same day: both are trying to extract reliable training signal from noisy, real-world sources where clean labeled examples are scarce. The difference is that Video2GUI bets on scale as the corrective, mining 500 million metadata entries to dilute noise, while the distillation work mines for high-value samples instead. These represent genuinely different philosophies about how to handle label quality at the frontier, and GUI agents are a domain where both approaches will likely be tested in parallel over the next year.
If WildGUI-pretrained agents show measurable generalization on held-out application categories that were absent from the 500M video pool, the video-to-trajectory pipeline is doing real semantic work. If gains collapse outside seen app categories, the dataset is primarily a coverage play rather than a generalization one.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsVideo2GUI · WildGUI · GUI agents · multimodal large language models
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.