Project Page: https://haoningwu3639.github.io/SpatialScore/
Paper: https://arxiv.org/abs/2505.17012/
Code: https://github.com/haoningwu3639/SpatialScore/
Data: https://huggingface.co/datasets/haoningwu/SpatialScore
We are currently organizing our data and code, and expect to open-source them within 1-2 weeks! Please stay tuned! Feel free to reach out for discussions!
To summarize, we make the following contributions in this paper:
(i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation;
(ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard;
(iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms;
(iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent.