LaViDa: A Large Diffusion Language Model For Multimodal Understanding

LaViDa, a family of vision-language models built on discrete diffusion models, offers competitive performance on multimodal benchmarks with advantages in speed, controllability, and bidirectional reasoning. AI-generated summary Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs’ potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Source link

What's Hot

3D and 4D World Modeling: A Survey – Takara TLDR

How We Built A Unicorn Without Chasing Hype Cycles

Sources: AI training startup Mercor eyes $10B+ valuation on $450M run rate

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

3D and 4D World Modeling: A Survey – Takara TLDR

EnvX: Agentize Everything with Agentic AI – Takara TLDR

P3-SAM: Native 3D Part Segmentation – Takara TLDR

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

Death of Matthew Christopher Pietras Ruled a Suicide

3D and 4D World Modeling: A Survey – Takara TLDR

How We Built A Unicorn Without Chasing Hype Cycles

Sources: AI training startup Mercor eyes $10B+ valuation on $450M run rate

What's Hot

LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Related Posts

Subscribe to Updates