View a PDF of the paper titled Weight Ensembling Improves Reasoning in Language Models, by Xingyu Dang and 4 other authors
View PDF
HTML (experimental)
Abstract:We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.
Submission history
From: Christina Baek [view email]
[v1]
Mon, 14 Apr 2025 17:59:07 UTC (2,242 KB)
[v2]
Tue, 15 Apr 2025 17:46:59 UTC (2,262 KB)
[v3]
Wed, 30 Apr 2025 07:56:09 UTC (2,262 KB)