SpaceVista: All-Scale Visual Spatial Reasoning From Mm To Km - Takara TLDR

With the current surge in spatial reasoning explorations, researchers have
made significant progress in understanding indoor scenes, but still struggle
with diverse applications such as robotics and autonomous driving. This paper
aims to advance all-scale spatial reasoning across diverse scenarios by
tackling two key challenges: 1) the heavy reliance on indoor 3D scans and
labor-intensive manual annotations for dataset curation; 2) the absence of
effective all-scale scene modeling, which often leads to overfitting to
individual scenes. In this paper, we introduce a holistic solution that
integrates a structured spatial reasoning knowledge system, scale-aware
modeling, and a progressive training paradigm, as the first attempt to broaden
the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using
a task-specific, specialist-driven automated pipeline, we curate over 38K video
scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising
approximately 1M spatial QA pairs spanning 19 diverse task types. While
specialist models can inject useful domain knowledge, they are not reliable for
evaluation. We then build an all-scale benchmark with precise annotations by
manually recording, retrieving, and assembling video-based data. However, naive
training with SpaceVista-1M often yields suboptimal results due to the
potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a
spatial reasoning model that accepts dense inputs beyond semantics and uses
scale as an anchor for scale-aware experts and progressive rewards. Finally,
extensive evaluations across 5 benchmarks, including our SpaceVista-Bench,
demonstrate competitive performance, showcasing strong generalization across
all scales and scenarios. Our dataset, model, and benchmark will be released on
https://peiwensun2000.github.io/mm2km .

Source link

What's Hot

NVIDIA-Backed Reflection AI Raises $2B, Valuation Jumps to $8B

Tesla faces new blockade in Sweden as IF Metall escalates dispute

Eve – AI-Driven Client Intake – Artificial Lawyer

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km – Takara TLDR

StreamingVLM: Real-Time Understanding for Infinite Video Streams – Takara TLDR

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents – Takara TLDR

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense – Takara TLDR

Toledo Museum of Art Director on Digital Art, AI, and Future-Proofing

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

NVIDIA-Backed Reflection AI Raises $2B, Valuation Jumps to $8B

Tesla faces new blockade in Sweden as IF Metall escalates dispute

Eve – AI-Driven Client Intake – Artificial Lawyer

What's Hot

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km – Takara TLDR

Related Posts

Subscribe to Updates