arXiv AI

Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

By Advanced AI EditorJune 18, 2025No Comments2 Mins Read

[Submitted on 24 Feb 2025 (v1), last revised 17 Jun 2025 (this version, v2)]

View a PDF of the paper titled LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification, by Penghui Yang and 6 other authors

View PDF
HTML (experimental)

Abstract:As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at this https URL.

Submission history

From: Penghui Yang [view email]
[v1]
Mon, 24 Feb 2025 18:53:31 UTC (4,932 KB)
[v2]
Tue, 17 Jun 2025 05:58:01 UTC (5,154 KB)

Previous ArticleStanford HAI’s 2025 AI Index Reveals Record Growth in AI Capabilities, Investment, and Regulation

Next Article MIT Study Warns of Cognitive Decline with LLM Use

Advanced AI Editor

Leave A Reply