
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace-Bench addresses a critical gap in agent evaluation by introducing the first large-scale benchmark that tests AI systems on realistic file-dependency reasoning across heterogeneous document ecosystems. With 20,476 files spanning 74 types and 388 curated tasks grounded in actual worker profiles, the benchmark moves beyond synthetic evaluation toward real-world complexity. This matters because autonomous agents deployed in enterprise settings must navigate implicit dependencies and update interconnected assets, a capability existing benchmarks have largely sidestepped. The work signals growing maturity in agent evaluation methodology and raises the bar for what 'workspace-ready' means in production AI systems.62
























