Research Tools & Code·arXiv cs.CL·12h ago

Scalable Behaviour Cloning on Browser Using via Skill Distillation

Researchers propose a scalable approach to training browser automation agents by distilling human interaction logs into reusable natural-language skills rather than training agents end-to-end. The key insight reframes the bottleneck from low-level UI control to decision-making under partial observability, arguing that human browsing traces already encode the priors agents need. By organizing distilled skills into a graph structure, the method enables agents to retrieve, compose, and chain behaviors across complex workflows like software development and enterprise tasks. This addresses a fundamental scaling challenge in web automation: how to leverage the massive corpus of human browser activity as training signal without requiring expensive labeled demonstrations.

Modelwire context

Explainer

The paper's most underappreciated contribution is the graph structure organizing distilled skills: this isn't just a retrieval index but a composability layer that encodes sequencing dependencies, which is precisely the combinatorial gap that embedding-based skill libraries have historically failed to close.

This work sits in direct conversation with 'Generative Skill Composition for LLM Agents' from the same day, which identified skill selection and sequencing as the structural bottleneck as skill libraries grow. That paper framed the problem; this one proposes a concrete mechanism for browser-specific domains. Meanwhile, 'QVal' from the same batch is relevant here too: the browser agent training pipeline will eventually need exactly the kind of cheap, dense supervision evaluation QVal proposes, since human browsing traces are noisy and intermediate step quality is hard to measure. The skill distillation framing also sidesteps some of the reward modeling complexity that 'Freeform Preference Learning for Robotic Manipulation' tackles for embodied agents, though the partial observability challenge in browser tasks may eventually demand similar nuance.

If this method is evaluated on a standardized web agent benchmark like WebArena or WorkArena within the next two quarters and the skill graph retrieval outperforms flat embedding retrieval by a meaningful margin, the composability claim holds up. If results only appear on proprietary task sets, the generalization case remains open.

Coverage we drew on

Generative Skill Composition for LLM Agents · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsBrowser agents · Behavior cloning · Skill distillation · Web automation

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.