Discrete Diffusion VLA: Bringing Discrete Diffusion To Action Decoding In Vision-Language-Action Policies - Takara TLDR

Vision-Language-Action (VLA) models adapt large vision-language backbones to
map images and instructions to robot actions. However, prevailing VLA decoders
either generate actions autoregressively in a fixed left-to-right order or
attach continuous diffusion or flow matching heads outside the backbone,
demanding specialized training and iterative sampling that hinder a unified,
scalable architecture. We present Discrete Diffusion VLA, a single-transformer
policy that models discretized action chunks with discrete diffusion and is
trained with the same cross-entropy objective as the VLM backbone. The design
retains diffusion’s progressive refinement paradigm while remaining natively
compatible with the discrete token interface of VLMs. Our method achieves an
adaptive decoding order that resolves easy action elements before harder ones
and uses secondary remasking to revisit uncertain predictions across refinement
rounds, which improves consistency and enables robust error correction. This
unified decoder preserves pretrained vision language priors, supports parallel
decoding, breaks the autoregressive bottleneck, and reduces the number of
function evaluations. Discrete Diffusion VLA achieves 96.3% avg. SR on LIBERO,
71.2% visual matching on SimplerEnv Fractal and 49.3% overall on SimplerEnv
Bridge, improving over both autoregressive and continuous diffusion baselines.
These findings indicate that discrete-diffusion action decoder supports precise
action modeling and consistent training, laying groundwork for scaling VLA to
larger models and datasets.

Source link

What's Hot

Legal AI For Crime, ACAS, Law Punx Ep.1, Juro + Wordsmith – Artificial Lawyer

Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Did Nvidia Just Pop an AI Bubble? Here’s What the Market Says

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies – Takara TLDR

Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Self-Rewarding Vision-Language Model via Reasoning Decomposition – Takara TLDR

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning – Takara TLDR

London Museum Secures Banksy’s Piranhas

Egyptian Antiquities Trafficker Sentenced to Six Months in Prison

Sotheby’s to Launch First Series of Luxury Auctions in Abu Dhabi

Nazi-Looted Painting Turns Up in Argentinean Real Estate Listing

Legal AI For Crime, ACAS, Law Punx Ep.1, Juro + Wordsmith – Artificial Lawyer

Collaborative Multi-Modal Coding for High-Quality 3D Generation – Takara TLDR

Did Nvidia Just Pop an AI Bubble? Here’s What the Market Says

What's Hot

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies – Takara TLDR

Related Posts

Subscribe to Updates