The rapidly increasing computational cost of pretraining Large Language
Models necessitates more efficient approaches. Numerous computational costs
have been invested in existing well-trained checkpoints, but many of them
remain underutilized due to engineering constraints or limited model capacity.
To efficiently reuse this “sunk” cost, we propose to recycle pretrained
checkpoints by expanding their parameter counts and continuing training. We
propose orthogonal growth method well-suited for converged Mixture-of-Experts
model: interpositional layer copying for depth growth and expert duplication
with injected noise for width growth. To determine the optimal timing for such
growth across checkpoints sequences, we perform comprehensive scaling
experiments revealing that the final accuracy has a strong positive correlation
with the amount of sunk cost, indicating that greater prior investment leads to
better performance. We scale our approach to models with 70B parameters and
over 1T training tokens, achieving 10.66% accuracy gain over training from
scratch under the same additional compute budget. Our checkpoint recycling
approach establishes a foundation for economically efficient large language
model pretraining.