WinDOM: Self-Family Distillation for Small-Model GUI Grounding

WinDOM tackles a critical constraint in on-device AI: training small GUI agents without expensive annotation. By harvesting 54K grounding examples from automated Windows 11 interaction and pairing them with Self-Family Distillation (a rejection-sampling technique using student or teacher EMA), the work pushes 2B-parameter models toward practical deployment on edge hardware. This matters because accessible, low-cost GUI automation has been bottlenecked by data scarcity and the scaling bias toward larger models. The approach signals a shift in how the field thinks about small-model viability for real-world tasks beyond language.
Modelwire context
ExplainerThe Self-Family Distillation framing is the part worth unpacking: rather than relying on a fixed, larger teacher model, the student can bootstrap from an exponential moving average of its own weights, which means the method doesn't require access to a proprietary or separately licensed teacher at inference time. That's a practical licensing and deployment consideration the summary leaves implicit.
The data bottleneck WinDOM addresses maps directly onto what the Autodata paper (also from June 24) is attacking from a different angle. Where Autodata uses an agentic meta-optimizer to generate synthetic training data across reasoning tasks, WinDOM harvests grounding examples through automated Windows 11 interaction and then refines them via rejection sampling. Both are responses to the same underlying constraint: human annotation doesn't scale cheaply. The HiReLC compression work (index 4) is also relevant here, since compressing a 2B model further for edge hardware is a natural next step once grounding quality is established.
Watch whether the WinDOM authors or independent replicators publish benchmark results on non-Windows GUI environments (macOS, Android) within the next six months. Generalization across platforms is the real test of whether Self-Family Distillation is producing transferable representations or just fitting Windows 11 interaction patterns.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsWinDOM · Self-Family Distillation · Windows 11 · Playwright
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.