We are in an era of booming large models, with various industries eager to leverage their powerful capabilities. However, how to free these models from the constraints of ‘static’ learning and achieve lifelong learninghas become a necessary path toward achieving true AGI. Recently, researchers from the Massachusetts Institute of Technology (MIT) Improbable AILab published a research paper on arXivfocusing on the issue of catastrophic forgettingin post-training of large models, proposing a thought-provoking viewpoint: the Occam’s razor principlemay be key to solving this problem.
SFT vs. RL: Who is Better at Knowledge Retention?
The core of the research lies in comparing two common post-training methods: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Surprisingly, even when SFT and RL perform similarly on new tasks, SFT often sacrifices old knowledge to improve on new tasks, while RL can learn new skills while better retaining existing capabilities. The researchers summarized this phenomenon as the ‘forgetting law’: the greater the difference between the fine-tuned model and the original model in the distribution of new tasks, the more severe the forgetting. They found that a key metric for measuring this difference is the KL divergence. Specifically, when a model is fine-tuned on a new task, the extent of forgetting can be predicted by the KL divergence between the fine-tuning policy and the baseline policy. More importantly, RL tends to choose solutions with smaller KL divergence, i.e., solutions closer to the original model, making RL less prone to forgetting than SFT.
RL’s ‘Razor’: The KL Minimum Path Principle
The researchers attributed RL’s advantages to its ‘KL preference’. In new tasks, there are many solutions that can achieve high performance. RL naturally prefers to select solutions that are closer to the original model (with smaller KL divergence); while SFT may converge to solutions that are far from the original model, resulting in severe forgetting. The core theoretical contribution is ‘RL’s razor’—that among all methods for solving new tasks, RL prefers solutions that are closest to the original model in terms of KL divergence. To validate the KL hypothesis, the researchers constructed an ideal ‘oracle SFT’ distribution: it achieves perfect accuracy on new tasks while minimizing KL. The results showed that training on this distribution resulted in even less forgetting than RL. This indicates that RL’s advantage does not stem from some ‘essential difference’, but from its implicit execution of KL minimization. As long as the training process leans toward KL minimum solutions, model forgetting will decrease.
The Advantages of Online Policy Learning and Future Outlook
To understand what mechanisms drive RL’s KL-conservative behavior, the researchers compared four different training paradigms. The analysis revealed that the online policy nature of data collection is a key factor, rather than the use of negative examples. Online policy methods maintain smaller KL shifts and better prior task retention, while offline methods behave similarly regardless of whether negative examples are used. This research provides a new perspective on post-training: to achieve continuous adaptation without forgetting, algorithms should explicitly aim to minimize KL divergence with the base model, establishing KL divergence as a fundamental design principle for continuous learning systems. This research opens the door to designing future training methods that combine RL’s ability to retain prior knowledge with the efficiency of SFT, allowing foundational models to truly ‘learn for a lifetime’.
For practitioners using foundational models, this research offers clear guidance: when continuous adaptation is important, online policy RLmethods have a significant advantage over standard fine-tuning methods. The KL divergencemetric also provides a practical tool for monitoring and predicting forgetting during model adaptation. This work helps us understand why common practices like KL regularization in RLHF are effective, elevating empirical observations to a theoretical foundation.
How do you think we can better balance the acquisition of new knowledge and the retention of old knowledge on the path to achieving general artificial intelligence?
返回搜狐,查看更多
平台声明:该文观点仅代表作者本人,搜狐号系信息发布平台,搜狐仅提供信息存储空间服务。