View a PDF of the paper titled Prompt-based Depth Pruning of Large Language Models, by Juyun Wee and 2 other authors
View PDF
HTML (experimental)
Abstract:Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent — a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Submission history
From: Juyun Wee [view email]
[v1]
Tue, 4 Feb 2025 15:16:17 UTC (9,796 KB)
[v2]
Fri, 14 Feb 2025 11:46:43 UTC (9,796 KB)
[v3]
Thu, 12 Jun 2025 01:56:10 UTC (4,076 KB)