Paper Page - MindJourney: Test-Time Scaling With World Models For Spatial Reasoning

MindJourney enhances vision-language models with 3D reasoning by coupling them with a video diffusion-based world model, achieving improved performance on spatial reasoning tasks without fine-tuning.

Spatial reasoning in 3D space is central to human cognition and indispensable
for embodied tasks such as navigation and manipulation. However,
state-of-the-art vision-language models (VLMs) struggle frequently with tasks
as simple as anticipating how a scene will look after an egocentric motion:
they perceive 2D images but lack an internal model of 3D dynamics. We therefore
propose MindJourney, a test-time scaling framework that grants a VLM with this
missing capability by coupling it to a controllable world model based on video
diffusion. The VLM iteratively sketches a concise camera trajectory, while the
world model synthesizes the corresponding view at each step. The VLM then
reasons over this multi-view evidence gathered during the interactive
exploration. Without any fine-tuning, our MindJourney achieves over an average
8% performance boost on the representative spatial reasoning benchmark SAT,
showing that pairing VLMs with world models for test-time scaling offers a
simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also
improves upon the test-time inference VLMs trained through reinforcement
learning, which demonstrates the potential of our method that utilizes world
models for test-time scaling.

Source link

What's Hot

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

OpenAI, Oracle, and SoftBank announced five new AI data centers as part of Stargate.

Scott Wiener on his fight to make Big Tech disclose AI’s dangers

Paper page – MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

LIMI: Less is More for Agency – Takara TLDR

OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models – Takara TLDR

Court Rules ‘Gender Ideology’ Ban on Art Endowments Unconstitutional

Rural Danish Art Museum Acquires Painting By Artemisia Gentileschi

Dan Nadel Is Expanding American Art History, One Outlier at a Time

Bernard Arnault Says French Wealth Tax Will ‘Destroy’ the Economy

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning – Takara TLDR

OpenAI, Oracle, and SoftBank announced five new AI data centers as part of Stargate.

Scott Wiener on his fight to make Big Tech disclose AI’s dangers

What's Hot

Paper page – MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Related Posts

Subscribe to Updates