Paper Page - Fast-dLLM: Training-free Acceleration Of Diffusion LLM By Enabling KV Cache And Parallel Decoding

A novel block-wise approximate KV Cache and confidence-aware parallel decoding strategy improve the inference speed of diffusion-based large language models without significant quality loss.

Diffusion-based large language models (Diffusion LLMs) have shown promise for
non-autoregressive text generation with parallel decoding capabilities.
However, the practical inference speed of open-sourced Diffusion LLMs often
lags behind autoregressive models due to the lack of Key-Value (KV) Cache and
quality degradation when decoding multiple tokens simultaneously. To bridge
this gap, we introduce a novel block-wise approximate KV Cache mechanism
tailored for bidirectional diffusion models, enabling cache reuse with
negligible performance drop. Additionally, we identify the root cause of
generation quality degradation in parallel decoding as the disruption of token
dependencies under the conditional independence assumption. To address this, we
propose a confidence-aware parallel decoding strategy that selectively decodes
tokens exceeding a confidence threshold, mitigating dependency violations and
maintaining generation quality. Experimental results on LLaDA and Dream models
across multiple LLM benchmarks demonstrate up to 27.6times
throughput improvement with minimal accuracy loss, closing the performance gap
with autoregressive models and paving the way for practical deployment of
Diffusion LLMs.

Source link

What's Hot

Nuclearn secures nuclear AI funding

Why Hardware Is The Next Frontier For Investors

ByteDance Volcano Engine Launches Command Line AI Agent veCLI, Terminal Access to Doubao Large Model_the

Paper page – Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

2D Gaussian Splatting with Semantic Alignment for Image Inpainting – Takara TLDR

The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward – Takara TLDR

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning – Takara TLDR

Long-Lost Painting By Rubens From 1613 Discovered in Paris Mansion

Ken Griffin Loves Pollock’s Blue Poles So Much He Tried to Buy it

Nan Goldin Says Her Market ‘Tanked’ Due to Palestine Activism

Sally Mann Says Her Black Men Photos Are ‘Problematic’ in Hindsight

Nuclearn secures nuclear AI funding

Why Hardware Is The Next Frontier For Investors

ByteDance Volcano Engine Launches Command Line AI Agent veCLI, Terminal Access to Doubao Large Model_the

What's Hot

Paper page – Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Related Posts

Subscribe to Updates