Paper page - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

A novel block-wise approximate KV Cache and confidence-aware parallel decoding strategy improve the inference speed of diffusion-based large language models without significant quality loss.

Diffusion-based large language models (Diffusion LLMs) have shown promise for
non-autoregressive text generation with parallel decoding capabilities.
However, the practical inference speed of open-sourced Diffusion LLMs often
lags behind autoregressive models due to the lack of Key-Value (KV) Cache and
quality degradation when decoding multiple tokens simultaneously. To bridge
this gap, we introduce a novel block-wise approximate KV Cache mechanism
tailored for bidirectional diffusion models, enabling cache reuse with
negligible performance drop. Additionally, we identify the root cause of
generation quality degradation in parallel decoding as the disruption of token
dependencies under the conditional independence assumption. To address this, we
propose a confidence-aware parallel decoding strategy that selectively decodes
tokens exceeding a confidence threshold, mitigating dependency violations and
maintaining generation quality. Experimental results on LLaDA and Dream models
across multiple LLM benchmarks demonstrate up to 27.6times
throughput improvement with minimal accuracy loss, closing the performance gap
with autoregressive models and paving the way for practical deployment of
Diffusion LLMs.

Source link

What's Hot

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

AI mainframe sales help IBM breeze past the Street’s targets on earnings and revenue

SecurityPal uses AI, experts in Nepal to answer security qs faster

Paper page – Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Paper page – HOComp: Interaction-Aware Human-Object Composition

Paper page – Does More Inference-Time Compute Really Help Robustness?

Paper page – RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Winston Artory Merger Targets $15B Art Valuation Market

Denver Museum Discovers 67.5 Million-Year-Old Fossil Under Parking Lot

Taipei Dangdai Cancels 2026 Edition

Andres Serrano Pitches Trump Mausoleum at the Venice Biennale

Xinxing Xu bridges AI research and real-world impact at Microsoft Research Asia – Singapore

AI mainframe sales help IBM breeze past the Street’s targets on earnings and revenue

SecurityPal uses AI, experts in Nepal to answer security qs faster

What's Hot

Paper page – Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Related Posts

Subscribe to Updates