Paper page - Parallel Scaling Law for Language Models

It is commonly believed that scaling language models should commit a
significant space or time cost, by increasing the parameters (parameter
scaling) or output tokens (inference-time scaling). We introduce the third and
more inference-efficient scaling paradigm: increasing the model’s parallel
computation during both training and inference time. We apply P diverse and
learnable transformations to the input, execute forward passes of the model in
parallel, and dynamically aggregate the P outputs. This method, namely
parallel scaling (ParScale), scales parallel computation by reusing existing
parameters and can be applied to any model structure, optimization procedure,
data, or task. We theoretically propose a new scaling law and validate it
through large-scale pre-training, which shows that a model with P parallel
streams is similar to scaling the parameters by O(log P) while showing
superior inference efficiency. For example, ParScale can use up to 22times
less memory increase and 6times less latency increase compared to parameter
scaling that achieves the same performance improvement. It can also recycle an
off-the-shelf pre-trained model into a parallelly scaled one by post-training
on a small amount of tokens, further reducing the training budget. The new
scaling law we discovered potentially facilitates the deployment of more
powerful models in low-resource scenarios, and provides an alternative
perspective for the role of computation in machine learning.

Source link

What's Hot

Meta seeks to ‘loosen up’ Llama AI chatbot with better answers to contentious questions: report

OpenAI inks deal for massive amount of AI computing power from Oracle’s US data centres

SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network

Paper page – Parallel Scaling Law for Language Models

Paper page – Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

Paper page – Radial Attention: O(nlog n) Sparse Attention with Energy Decay for Long Video Generation

Paper page – ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Khaled Sabsabi Reinstated as Australia’s Venice Biennale Artist

Peter Phillips, British Pop Art Originator, Dies at 86

Hundreds of Ancient Ceramics Found In Preserved Shipwreck in Turkey

Canaletto Auction Record Smashed at Christie’s London

Meta seeks to ‘loosen up’ Llama AI chatbot with better answers to contentious questions: report

OpenAI inks deal for massive amount of AI computing power from Oracle’s US data centres

SEZ-HARN: Self-Explainable Zero-shot Human Activity Recognition Network

What's Hot

Paper page – Parallel Scaling Law for Language Models

Related Posts

Subscribe to Updates