𼳠Happy to share our new work â Kinetics: Rethinking Test-Time Scaling Laws
đ¤How to effectively build a powerful reasoning agent?
Existing compute-optimal scaling laws suggest 64K thinking tokens + 1.7B model > 32B model.
But, It only shows half of the picture!
đ¨ The O(N²) KV memory access in self-attention dominates the cost of test-time scaling (TTS).
MoEs even worsen memory bottleneck by cutting compute.
Our new scaling law, Kinetics, suggests – invest in model size first before spending more in test-time compute.
This insight leads to our next key finding
⨠Sparse Attention = Scalable TTS
Our Kinetics sparse scaling law says that when doubling the resources, we should prioritize increasing test time tokens over attention density.
â
60+ points improvement under the same compute budget
â
10Ă lower resource usage for equivalent performance
â
Sparse attention becomes increasingly valuable in high-cost scenarios
đĄSparsity is key to unlocking full potential of TTS, because unlike pretraining, where scaling shows diminishing returns, TTS continues to benefit from increased token generation and more optimized inference paths.
Arxiv link: https://arxiv.org/abs/2506.05333
Website: https://infini-ai-lab.github.io/Kinetics/
Twitter: https://x.com/InfiniAILab/status/1931053042876768586