This research, conducted by the Tencent Hunyuan team in collaboration with the University of Macau, Chinese University of Hong Kong, and Tokyo Institute of Technology, was published at the 42nd International Conference on Machine Learning (ICML 2025) in June 2025. The research team includes several experts such as Sun Xingwu and Li Shuaipeng. Interested readers can access the full paper through the paper number arXiv:2501.02423v3.
When you take photos with your mobile phone, each photo requires a large amount of data to record color and brightness information. Similarly, training large AI models involves processing vast amounts of numerical computations. However, there is a problem: if every number is processed at the highest precision, like capturing each photo in the highest resolution, the computer’s storage space and processing speed can be severely hampered.
Thus, engineers thought of a clever solution: using “floating-point numbers” to represent these numbers. You can think of floating-point numbers as a scientific notation for numbers, such as writing 123000 as 1.23×10^5. This representation consists of two key parts: one is the “mantissa” (like 1.23), which represents the specific value of the number; the other is the “exponent” (like 5), which indicates the range of the number. In a computer, these correspond to the “mantissa bits” and “exponent bits,” which together determine the precision and representation range of a floating-point number.
However, just like the balance of seasoning in cooking, the proportion of exponent bits to mantissa bits directly affects the final “flavor”—that is, the performance of the AI model. Past research has primarily focused on integer quantization (similar to using coarse seasoning), lacking in-depth studies on the impact of these precise ratios in floating-point quantization training. The Tencent Hunyuan team discovered that existing predictive models faced with floating-point quantization training performed poorly, akin to cooking with the wrong recipe.
Therefore, the research team decided to start from scratch and systematically explore the secrets of floating-point quantization training. They designed 366 different experiments, similar to a super chef trying various seasoning ratios in the kitchen, testing how different model sizes, data volumes, and configurations of exponent and mantissa bits affected AI model performance.
1. Discovery of the “Golden Ratio” of Floating Point Numbers
The research team first addressed a fundamental question: which computational aspects should be quantized when training AI models? This is akin to deciding which steps in cooking can use simplified tools and which must be meticulously crafted.
In the neural networks of AI models, each layer involves complex matrix computations. The research team found that these computations could be divided into three main stages: forward computation (equivalent to the initial processing of ingredients), input gradient computation (similar to adjusting flavors), and parameter gradient computation (akin to summarizing experiences). Each stage has two key inputs that need to be processed.
Through extensive experimentation, the research team discovered an interesting phenomenon: not all computational aspects require high precision processing. Specifically, quantizing weights, weights in backpropagation, and gradients of activation values had relatively little impact on model performance, similar to how the precision of some seasonings in cooking is not that critical. However, if input activation values are quantized, especially during input gradient computation, model performance significantly declines, with losses potentially increasing by up to 2%.
This finding made the research team realize that different computational aspects have varying importance in AI model training. Some aspects are like essential seasonings in cooking, requiring precise control; others can be simplified like side dishes.
2. The Clever Balance Between Exponent Bits and Mantissa Bits
Next, the research team delved into the impact of exponent and mantissa bit configurations. This is akin to studying how the ratio of sugar to salt affects the taste of a dish, needing to find the optimal balance.
The traditional view holds that exponent bits and mantissa bits should be equally important, similar to believing that sugar and salt play equivalent roles in seasoning. However, the research team’s experimental results were surprising: the contribution of exponent bits to model performance was slightly greater than that of mantissa bits. This means that under a limited number of digit budgets, allocating slightly more bits to exponent bits yields better results.
Specifically, when there are a total of 4 bits available, the optimal configuration is 2 exponent bits and 1 mantissa bit; for 8 bits, the optimal configuration is 4 exponent bits and 3 mantissa bits; and for 16 bits, it is 8 exponent bits and 7 mantissa bits. This discovery provides valuable references for hardware manufacturers, akin to offering optimal tool specifications to cookware designers.
The research team also found that there are deep mathematical principles behind this ratio rule. Through extensive experimental data fitting, they discovered an accurate formula that predicts how to allocate the number of exponent and mantissa bits for any given bit budget.
3. The “Critical Point” Phenomenon of Data Scale
During the exploration, the research team found a surprising phenomenon: in low-precision training, more training data is not necessarily better. This is akin to discovering the issue of “nutritional surplus”—when the intake of nutrients exceeds the body’s ability to effectively absorb them, it may negatively impact health.
In traditional high-precision training, increasing training data typically continues to improve model performance. However, in low-precision training, the situation is entirely different. When the amount of training data exceeds a certain critical value, model performance not only fails to improve but starts to decline.
This critical value is referred to by the research team as the “critical data size.” Its existence can be understood through the concept of “knowledge density.” In low-precision training, the model acts like a limited-capacity container; when too much information is attempted to be packed in, the container “overflows,” degrading the quality of the existing information.
The research team derived an exact formula to compute this critical data size through mathematical derivation. They found that the larger the model size, the higher the training precision, and the smaller the quantization block size, the later this critical point appears. This is akin to how larger containers, better materials, and finer structures can enhance a container’s capacity.
For a model with one billion parameters, when trained at BF16 precision, the critical data size reaches as high as 1730T (trillions of characters), far exceeding the scale of existing datasets, so we have never observed this phenomenon. However, when training using the FP8-E4M3 format, the critical data size drops to 27T; and when using the FP4-E2M1 format, it further plummets to just 0.4T. This explains why excessive data can harm model performance in extremely low-precision training.
4. Optimal Allocation Strategy for Computational Budget
The research team also explored how to optimally allocate computational resources under a fixed computational budget. This is akin to planning a grand meal within a fixed budget, needing to find the best balance among ingredient quality, dish quantity, and cooking precision.
When the data volume is fixed, the research team discovered an interesting strategy: use an aggressive quantization strategy (such as FP8 or even FP4) in the early stages of training to quickly bring the model to a satisfactory level; as the data volume increases and “knowledge density” rises, gradually increase training precision to BF16 or even FP32 to maintain optimal training effectiveness. This is like cooking, where one first uses high heat for quick heating, then switches to low heat for slow simmering.
When the model size is fixed, the research team found a power-law relationship between precision and computational budget. Through this relationship, they can predict what the optimal quantization precision should be under any given computational budget.
Most importantly, when optimizing model size, data volume, and precision simultaneously, the research team found a key conclusion: across a wide range of computational budgets (from 10^21 to 10^31 floating-point operations), the optimal cost-performance precision consistently remains between 4-8 bits. This means that regardless of your computational budget, training with 4-8 bit precision will yield the best cost-effectiveness.
5. The Birth of the Capybara Scaling Law
Based on all these discoveries, the research team proposed their core contribution: the Capybara Scaling Law. This law acts like a universal formula that can accurately predict the final performance of an AI model under any given combination of model size, data volume, exponent bits, mantissa bits, and quantization block size.
The name Capybara is quite meaningful. In nature, capybaras are social animals, but when their habitat becomes too crowded, the increase in population density actually reduces the quality of individual survival. This parallels the phenomenon discovered by the research team: in low-precision training, excessive data (equivalent to high “knowledge density”) can harm model performance.
The mathematical expression of this scaling law appears complex, but its core idea is simple. It consists of two main parts: one is the traditional Chinchilla scaling law part, which describes the basic impact of data volume and model size on performance; the other is the newly added precision impact part, which describes the additional performance loss caused by low-precision training.
The precision impact part can be understood as the product of “knowledge density” and “low-precision information loss.” “Knowledge density” is determined by the ratio of data volume to model size, indicating the amount of information that needs to be processed per unit model capacity; “low-precision information loss” is determined by the exponent bits, mantissa bits, and quantization block size, indicating the degree of information loss caused by the quantization process.
6. Experimental Validation and Application Value
To validate the accuracy of the Capybara Scaling Law, the research team conducted large-scale experimental verification. They trained various models ranging from 41 million to 679 million parameters, using different data volumes from 10 billion to 100 billion training tokens, and tested 36 different combinations of precision configurations.
The experimental results were exciting: compared to previous predictive methods, the Capybara Scaling Law was able to predict model performance more accurately, especially in low-precision training scenarios. Previous methods exhibited significant prediction biases when facing extremely low precision configurations like FP3, akin to cooking with the wrong recipe, often yielding unsatisfactory results. In contrast, the predictions made by the Capybara Scaling Law closely matched the actual test results.
More importantly, the research team also validated the applicability of this law on larger models. They tested models with 1.2 billion, 7 billion, and 70 billion parameters, and found that the Capybara Scaling Law could still accurately predict performance, demonstrating its reliability in large-scale applications.
This research has enormous practical value. For AI model developers, they can now accurately predict model performance under different configurations before commencing expensive training processes, allowing them to select optimal training strategies. For hardware manufacturers, the best floating-point format configuration guidelines provided by the research can help them design more efficient AI training chips. For research institutions and companies, the optimal cost-performance recommendations for 4-8 bit precision can help them achieve the best results within limited budgets.
7. Profound Impact on the Future
The impact of this research extends far beyond the technical level. It reveals a fundamental trade-off in AI training: in the pursuit of efficiency, we need to find the best balance among precision, speed, cost, and performance.
From an industrial development perspective, this research provides significant support for the democratization of AI. By optimizing quantization strategies, more research institutions and small companies can train high-quality AI models with fewer computational resources. This is akin to inventing more efficient cooking methods, allowing more people to create delicious dishes.
From a scientific research perspective, the Capybara Scaling Law provides new insights into understanding the learning mechanisms of AI models. The discovery of the “critical data size” reveals the intrinsic relationship between model capacity and information processing ability, providing theoretical guidance for future model architecture design.
From an environmental protection perspective, more efficient training strategies mean reduced energy consumption. As the world focuses on the carbon emissions of AI training, this research offers a practical solution: through intelligent quantization strategies, significant reductions in training costs can be achieved while maintaining model performance.
Of course, this research has its limitations. The current experiments are primarily based on the Transformer architecture, and the applicability to other emerging architectures (such as the Mamba series) needs further validation. The research focuses on classic floating-point quantization strategies, and support for other novel low-bit quantization methods is yet to be expanded.
Ultimately, the most important value of this research lies in its transformation of our understanding of AI training efficiency. In the past, we might have thought that to achieve better AI models, one must use higher precision, more data, and stronger computing power. However, the discoveries of the Tencent Hunyuan team tell us that intelligent strategies are often more effective than brute force. Just as excellent chefs are not defined by having the most expensive ingredients but by mastering the best cooking techniques.
This research provides the entire AI community with a valuable toolbox, enabling every developer to find the most suitable training strategy based on their specific needs and resource constraints. In today’s rapidly advancing AI technology landscape, such research outcomes are particularly precious, as they not only drive technological progress but also make technology more accessible and sustainable.
Readers interested in further exploring the technical details can refer to the complete research report through the paper number arXiv:2501.02423v3, which includes detailed mathematical derivations, experimental designs, and result analyses.
Q&A
Q1: What is the Capybara Scaling Law? What problem does it help to solve?
A: The Capybara Scaling Law is a mathematical formula proposed by the Tencent Hunyuan team that can accurately predict the performance of AI models under different configurations of model size, data volume, and floating-point precision. It primarily addresses the issue of inaccurate performance predictions in low-precision training, helping developers choose optimal configurations before commencing costly training.
Q2: Why is more training data not necessarily better? What is the critical data size?
A: In low-precision training, there is a phenomenon known as “critical data size”; when training data exceeds this critical value, model performance declines. This is because, in low-precision training, the model’s information processing capability is limited, similar to a container with limited capacity; attempting to fit in too much information leads to “overflow,” affecting the quality of existing information.
Q3: How should exponent and mantissa bits be configured in floating-point quantization training?
A: The research found that exponent bits contribute slightly more to model performance than mantissa bits. The optimal configuration is: for a total precision of 4 bits, use 2 exponent bits and 1 mantissa bit; for 8 bits, use 4 exponent bits and 3 mantissa bits; and for 16 bits, use 8 exponent bits and 7 mantissa bits. The best cost-performance ratio can be achieved within the 4-8 bit precision range.返回搜狐,查看更多