Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

A critical gap in how we measure agent safety has emerged: distinguishing between genuine risk avoidance and mere incapacity. PhoneSafety, a new 700-instance benchmark spanning 130+ mobile apps, isolates this problem by forcing a three-way classification at risky moments: safe action taken, unsafe action taken, or complete failure. This matters because current evaluations conflate these outcomes, masking whether an agent's harmlessness stems from learned judgment or architectural limitation. For practitioners building phone-use systems, the distinction determines whether to retrain, redesign, or debug. The benchmark exposes a fundamental weakness in existing safety methodology that has likely inflated confidence in deployed agents.
Modelwire context
ExplainerThe deeper provocation here is epistemological: every safety score produced by prior phone-agent evaluations may be measuring the wrong thing entirely, because a model that refuses to act due to confusion looks identical, in aggregate metrics, to one that refuses due to principled caution.
This is largely disconnected from recent activity in our archive, as we have no prior coverage of mobile agent safety benchmarks or phone-use agent research to anchor against. The work belongs to a broader cluster of agent reliability research, sitting alongside ongoing debates about how capability and alignment interact in agentic settings. The core tension PhoneSafety surfaces, that harmlessness and safety are not synonymous, has been implicit in alignment discourse for years but rarely operationalized at the task level. Without prior Modelwire coverage to triangulate against, readers should treat this as an entry point into that conversation rather than a development in a thread we have been tracking.
Watch whether major phone-agent frameworks (Google's Android agent work or any derivative of AppAgent) publish PhoneSafety scores within the next two quarters. If leading labs adopt the three-way classification in their own evals, the benchmark has traction; if it stays confined to citation counts, the methodology may be sound but practically ignored.
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsPhoneSafety
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.