CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities

CityRep addresses a critical gap in urban AI evaluation by introducing the first spatially-aware benchmark for city-scale representation learning. Current urban foundation models suffer from spatial data leakage in train-test splits, masking poor cross-city generalization. This benchmark standardizes evaluation across heterogeneous data modalities, multiple cities, and diverse downstream tasks through a unified alignment framework. The work matters because urban AI is becoming infrastructure-critical for smart cities, autonomous systems, and climate modeling, yet lacks rigorous evaluation standards. CityRep's spatial-split methodology sets a precedent for domain-specific benchmarking that prevents inflated performance claims.
Modelwire context
ExplainerCityRep's core innovation is not just a new benchmark, but a specific methodological fix: it enforces spatial rather than random train-test splits to prevent models from memorizing geographic patterns within a single city rather than learning transferable representations. This distinction matters because it reveals that prior urban foundation models may have reported inflated performance on tasks they couldn't actually generalize across cities.
This work is part of a broader wave of domain-specific benchmarking rigor visible across recent research. Like WSADBench (which unified fragmented anomaly detection evaluation) and DiscoverPhysics (which isolated genuine reasoning from memorization), CityRep tackles a hidden evaluation flaw that inflates claimed capabilities. The common thread across these papers is that standardized, constraint-aware benchmarks expose gaps between published results and real-world utility. CityRep applies that same logic to spatial generalization, a problem unique to geographic AI but methodologically aligned with how the field is maturing its measurement practices.
If urban foundation models retrained with CityRep's spatial-split methodology report significantly lower cross-city transfer accuracy than their original papers claimed, that confirms the benchmark caught a real leakage problem. If performance stays similar, the prior work was already sound and CityRep's value is mainly in standardization rather than correction.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsCityRep
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.