Dion: The Distributed Orthonormal Update Revolution Is Here

Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.

Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)–W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP and TP parallelization becomes desirable. Going back to the inspiration for Muon, the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.

What’s an orthonormal update?

Figure1. Illustration of matrix parameters

At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

What is Dion?

While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

Figure 2. Pseudocode of the centralized version of Dion

This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

Dion implements orthonormalization using amortized power iteration (opens in new tab). Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it.

Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

How does it work?

Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

Figure 3. Wall-clock time speedup of Dion for 3B model training

Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

Figure 4. Scaling of Dion across different batch sizes

Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).

In our experiments, these benefits extend to various post-training regimes as well.

We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

Figure 5. Low-rank Dion across different model sizes

Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.

Using hardware timings of the individual update steps suggests a story that looks this:

Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

Source link

What's Hot

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs – Takara TLDR

Huawei Ascend Roadmap Could Challenge Nvidia AI Leadership

No More Pikachu Oppenheimer? OpenAI Promises Rightsholders More Control Over Sora Creations

Dion: the distributed orthonormal update revolution is here

Using AI to assist in rare disease diagnosis

Tool-space interference in the MCP era: Designing for agent compatibility at scale

RenderFormer: How neural networks are reshaping 3D rendering

Record Exec and Art Collector Gets Over 4 Years

Chicago’s Art Scene Offers a Beacon of Hope for Artists and Dealers

Pace to Close Hong Kong Gallery at H Queen’s This Month

Taylor Swift’s ‘Fate of Ophelia’ Has a Lot in Common with This Artwork