World-aware Planning Narratives Enhance Large Vision-Language Model Planner

arXiv:2506.21230v1 Announce Type: new
Abstract: Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios involving unfamiliar environments and multi-step goals. Current approaches rely on environment-agnostic imitation learning that disconnects instructions from environmental contexts, causing models to struggle with context-sensitive instructions and rely on supplementary cues rather than visual reasoning during long-horizon interactions. In this work, we propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding through four cognitive capabilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic grounding) while developing and evaluating models using only raw visual observations through curriculum learning. Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5-VL achieving a 60.7 absolute improvement in task success rates, particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). Notably, our enhanced open-source models outperform proprietary systems like GPT-4o and Claude-3.5-Sonnet by a large margin.

Source link

What's Hot

MIT becomes first college to reject Trump’s higher education compact

Former UK Prime Minister Rishi Sunak to advise Microsoft and Anthropic

Jensen Huang says China is ‘nanoseconds behind’ the US in chipmaking, calls for reducing US export restrictions on Nvidia’s AI chips

World-aware Planning Narratives Enhance Large Vision-Language Model Planner

LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents

From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Frieze to Launch Abu Dhabi Fair in November 2026

Jeff Koons Returns to Gagosian with First New York Show in Seven Years

Ancient Egyptian Iconography Found in Roman-Era Bathhouse in Turkey

London Gallery Harlesden High Street Goes to Mayfair For a Pop-up

MIT becomes first college to reject Trump’s higher education compact

Former UK Prime Minister Rishi Sunak to advise Microsoft and Anthropic

Jensen Huang says China is ‘nanoseconds behind’ the US in chipmaking, calls for reducing US export restrictions on Nvidia’s AI chips

What's Hot

World-aware Planning Narratives Enhance Large Vision-Language Model Planner

Related Posts

Subscribe to Updates