Paper Page - Visual Planning: Let's Think Only With Images

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising alternative to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Source link

What's Hot

Google’s Nano Banana AI image editor is coming to search, Photos, and NotebookLM

AutoPR: Let’s Automate Your Academic Promotion! – Takara TLDR

Kitsa transforms clinical trial site selection with Amazon Quick Automate

Paper page – Visual Planning: Let’s Think Only with Images

AutoPR: Let’s Automate Your Academic Promotion! – Takara TLDR

TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control – Takara TLDR

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents – Takara TLDR

Artist Behind Canterbury Cathedral Art Responds to JD Vance, Elon Musk

Jenkins Johnson Gallery to Open Tribeca Outpost on Marian Goodman Gallery’s Third Floor

Ruth Asawa May Have Broken Record at MoMA—and More Art News

Toledo Museum of Art Director on Digital Art, AI, and Future-Proofing

Google’s Nano Banana AI image editor is coming to search, Photos, and NotebookLM

AutoPR: Let’s Automate Your Academic Promotion! – Takara TLDR

Kitsa transforms clinical trial site selection with Amazon Quick Automate

What's Hot

Paper page – Visual Planning: Let’s Think Only with Images

Related Posts

Subscribe to Updates