DigitalCoach: Communication and Grounding Gaps in Human and Agentic Computer Use Coaching

Researchers have constructed DigitalCoach, a 72-session multimodal dataset capturing expert software instruction across 28 hours of screen recordings and 22,752 dialogue turns. The work exposes a critical gap in how current LLMs coach versus how humans do: models default to direct commands while omitting explanations, error diagnosis, and verification questions. Even when prompted to match human coaching patterns, models struggle to ground their guidance in visual context. This finding matters because it signals that scaling language models alone won't solve the human-computer training problem, and that agentic systems designed to teach require fundamentally different training objectives than those optimized for task completion.
Modelwire context
ExplainerThe more pointed finding here is not just that LLMs give worse coaching, but that prompting alone cannot close the gap. Even when models are explicitly instructed to behave like human coaches, they fail to anchor guidance in what is visually present on screen, suggesting the deficit is architectural rather than a prompt engineering problem.
This connects directly to the 'Self-Study Reconsidered' paper from the same day, which showed that synthetic QA generation introduces systematic biases in what models learn to attend to. DigitalCoach extends that concern into a different domain: if models are trained on data that underrepresents explanation, verification, and error diagnosis, no amount of instruction-following fine-tuning will surface behaviors that were never in the training signal. Both papers are pointing at the same upstream problem from different angles. The visual grounding failure also echoes the surrogate fidelity work, which found that surface-level behavioral alignment between models can mask deep internal divergence, here the surface command looks correct but the reasoning scaffolding is absent.
Watch whether any lab releases a coaching-specific fine-tune trained on DigitalCoach or a comparable instructional dataset within the next six months. If benchmark performance on visual grounding tasks improves but coaching quality on held-out sessions does not, that confirms the deficit requires new training objectives rather than better base models.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsDigitalCoach · LLMs
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.