Paper Page - ZeCO: Zero Communication Overhead Sequence Parallelism For Linear Attention

A new zero communication overhead sequence parallelism method called ZeCO enables efficient training of large language models with ultra-long sequences across multiple devices.

Linear attention mechanisms deliver significant advantages for Large Language
Models (LLMs) by providing linear computational complexity, enabling efficient
processing of ultra-long sequences (e.g., 1M context). However, existing
Sequence Parallelism (SP) methods, essential for distributing these workloads
across devices, become the primary bottleneck due to substantial communication
overhead. In this paper, we introduce ZeCO (Zero Communication Overhead)
sequence parallelism for linear attention models, a new SP method designed to
overcome these limitations and achieve end-to-end near-linear scalability for
long sequence training. For example, training a model with a 1M sequence length
across 64 devices using ZeCO takes roughly the same time as training with an
16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new
collective communication primitive. All-Scan provides each SP rank with
precisely the initial operator state it requires while maintaining a minimal
communication footprint, effectively eliminating communication overhead.
Theoretically, we prove the optimaity of ZeCO, showing that it introduces only
negligible time and space overhead. Empirically, we compare the communication
costs of different sequence parallelism strategies and demonstrate that
All-Scan achieves the fastest communication in SP scenarios. Specifically, on
256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to
the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a
clear path toward efficiently training next-generation LLMs on previously
intractable sequence lengths.

Source link

What's Hot

Detroit Free Press partners with Perplexity: Why it matters

Upheaval at Aleph Alpha: Founder leaves, Schwarz Group moves up

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

Paper page – ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

UniVideo: Unified Understanding, Generation, and Editing for Videos – Takara TLDR

Reinforcing Diffusion Models by Direct Group Preference Optimization – Takara TLDR

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Museums Prepare to Close Their Doors as Government Shutdown Continues

Detroit Free Press partners with Perplexity: Why it matters

Upheaval at Aleph Alpha: Founder leaves, Schwarz Group moves up

First Try Matters: Revisiting the Role of Reflection in Reasoning Models – Takara TLDR

What's Hot

Paper page – ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Related Posts

Subscribe to Updates