GShard: Scaling Giant Models With Conditional Computation And Automatic Sharding (Paper Explained)

Google builds a 600 billion parameter transformer to do massively multilingual, massive machine translation. Interestingly, the larger model scale does not come from increasing depth of the transformer, but from increasing width in the feedforward layers, combined with a hard routing to parallelize computations on up to 2048 TPUs. A very detailed engineering paper!

OUTLINE:
0:00 – Intro & Overview
4:10 – Main Results
5:10 – Mixture-of-Experts
16:00 – Difference to Scaling Classic Transformers
18:50 – Backpropagation in Mixture-of-Experts
20:05 – MoE Routing Algorithm in GShard
38:20 – GShard Einsum Examples
47:40 – Massively Multilingual Translation
56:00 – Results
1:11:30 – Conclusion & Comments

ERRATA:
I said the computation of MoE scales linearly, but actually, it’s sub(!)-linear.

Paper:

Abstract:
Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

Authors:
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen

Links:
YouTube:
Twitter:
Discord:
BitChute:
Minds:

source

What's Hot

Unveiling the next wave of Startup Battlefield 200 VC judges at Disrupt 2025 | TechCrunch

Built for SF by SF: AI Solutions Helping Our City Thrive

AMD signs agreement with generative AI startup Cohere for expanded use of Instinct GPUs

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

AGI is not coming!

Context Rot: How Increasing Input Tokens Impacts LLM Performance (Paper Analysis)

Energy-Based Transformers are Scalable Learners and Thinkers (Paper Review)

Matthiesen Gallery Files Lawsuit Over Gustave Courbet Painting

MoMA Partners with Mattel for Van Gogh Barbie, Monet and Dalí Figures

Underground Film Legend and Artist Dies at 92

Artwork Forfeited by Inigo Philbrick’s Partner Flops at Sotheby’s

Unveiling the next wave of Startup Battlefield 200 VC judges at Disrupt 2025 | TechCrunch

Built for SF by SF: AI Solutions Helping Our City Thrive

AMD signs agreement with generative AI startup Cohere for expanded use of Instinct GPUs

What's Hot

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding (Paper Explained)

Related Posts

Subscribe to Updates