Discrete Diffusion VLA: Bringing Discrete Diffusion To Action Decoding In Vision-Language-Action Policies - Takara TLDR

Vision-Language-Action (VLA) models adapt large vision-language backbones to
map images and instructions to robot actions. However, prevailing VLA decoders
either generate actions autoregressively in a fixed left-to-right order or
attach continuous diffusion or flow matching heads outside the backbone,
demanding specialized training and iterative sampling that hinder a unified,
scalable architecture. We present Discrete Diffusion VLA, a single-transformer
policy that models discretized action chunks with discrete diffusion and is
trained with the same cross-entropy objective as the VLM backbone. The design
retains diffusion’s progressive refinement paradigm while remaining natively
compatible with the discrete token interface of VLMs. Our method achieves an
adaptive decoding order that resolves easy action elements before harder ones
and uses secondary remasking to revisit uncertain predictions across refinement
rounds, which improves consistency and enables robust error correction. This
unified decoder preserves pretrained vision language priors, supports parallel
decoding, breaks the autoregressive bottleneck, and reduces the number of
function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO,
71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv
Bridge, improving over both autoregressive and continuous diffusion baselines.
These findings indicate that discrete-diffusion action decoder supports precise
action modeling and consistent training, laying groundwork for scaling VLA to
larger models and datasets.

Source link

What's Hot

Shadow AI at Work Is Quietly Rewriting Job Dynamics

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning – Takara TLDR

Empowering air quality research with secure, ML-driven predictive analytics

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies – Takara TLDR

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning – Takara TLDR

Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents – Takara TLDR

Diffusion Language Models Know the Answer Before Decoding – Takara TLDR

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

Nazi-Looted Painting Turns Up in Argentinean Real Estate Listing

Christian Cross Unearthed at Monastic Site in Abu Dhabi

Shadow AI at Work Is Quietly Rewriting Job Dynamics

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning – Takara TLDR

Empowering air quality research with secure, ML-driven predictive analytics

What's Hot

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies – Takara TLDR

Related Posts

Subscribe to Updates