Paper page - DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

DreamVLA improves robot manipulation through a VLA framework that incorporates world knowledge, dynamic-region guidance, and a diffusion-based transformer to ensure clear, disentangled representations for action planning.

Recent advances in vision-language-action (VLA) models have shown promise in
integrating image generation with action prediction to improve generalization
and reasoning in robot manipulation. However, existing methods are limited to
challenging image-based forecasting, which suffers from redundant information
and lacks comprehensive and critical world knowledge, including dynamic,
spatial and semantic information. To address these limitations, we propose
DreamVLA, a novel VLA framework that integrates comprehensive world knowledge
forecasting to enable inverse dynamics modeling, thereby establishing a
perception-prediction-action loop for manipulation tasks. Specifically,
DreamVLA introduces a dynamic-region-guided world knowledge prediction,
integrated with the spatial and semantic cues, which provide compact yet
comprehensive representations for action planning. This design aligns with how
humans interact with the world by first forming abstract multimodal reasoning
chains before acting. To mitigate interference among the dynamic, spatial and
semantic information during training, we adopt a block-wise structured
attention mechanism that masks their mutual attention, preventing information
leakage and keeping each representation clean and disentangled. Moreover, to
model the conditional distribution over future actions, we employ a
diffusion-based transformer that disentangles action representations from
shared latent features. Extensive experiments on both real-world and simulation
environments demonstrate that DreamVLA achieves 76.7% success rate on real
robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Source link

What's Hot

AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

TU Wien Rendering #33 – Metropolis Light Transport

Paper page – DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Paper page – High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Paper page – The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Paper page – Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Adam Lindemann to Close Venus Over Manhattan After 14 Years

Ed Sheeran Is Ripping Off Jackson Pollock with His Paintings

Art Basel Selects Artist Wael Shawky to Lead Forthcoming Qatar Fair

Pioneer Works Hosts a MSCHF Sculpture You Can Take Home by the Inch

AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI

California lawmaker behind SB 1047 reignites push for mandated AI safety reports

TU Wien Rendering #33 – Metropolis Light Transport

What's Hot

Paper page – DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Related Posts

Subscribe to Updates