Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs – Takara TLDR

Huawei Ascend Roadmap Could Challenge Nvidia AI Leadership

No More Pikachu Oppenheimer? OpenAI Promises Rightsholders More Control Over Sora Creations

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • OpenAI (GPT-4 / GPT-4o)
    • Anthropic (Claude 3)
    • Google DeepMind (Gemini)
    • Meta (LLaMA)
    • Cohere (Command R)
    • Amazon (Titan)
    • IBM (Watsonx)
    • Inflection AI (Pi)
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • AI Experts
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • The TechLead
    • Matt Wolfe AI
    • Andrew Ng
    • OpenAI
    • Expert Blogs
      • François Chollet
      • Gary Marcus
      • IBM
      • Jack Clark
      • Jeremy Howard
      • Melanie Mitchell
      • Andrew Ng
      • Andrej Karpathy
      • Sebastian Ruder
      • Rachel Thomas
      • IBM
  • AI Tools
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
  • AI Policy
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
  • Business AI
    • Advanced AI News Features
    • Finance AI
    • Healthcare AI
    • Education AI
    • Energy AI
    • Legal AI
LinkedIn Instagram YouTube Threads X (Twitter)
Advanced AI News
Microsoft Research

Dion: the distributed orthonormal update revolution is here

By Advanced AI EditorAugust 12, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.

Training AI models requires choosing an optimizer and for nearly a decade, Adam( (opens in new tab)–W) (opens in new tab) has been the optimizer of choice. Given that durability and success, it was fair to doubt that any further improvement was possible. And yet, last December, a new optimizer called Muon (opens in new tab) showed serious promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with multiple AI labs (e.g., Kimi-AI (opens in new tab) and Essential-AI (opens in new tab)) reporting 2x scale improvements and the release of the 1T parameter Kimi K2 (opens in new tab) model. Restated: you can train a model to similar performance with half as many GPUs.

There’s one fly in the ointment: Muon requires large matrix multiplications in the optimizer, which requires heavy communication in large models at the scale where FSDP and TP parallelization becomes desirable. Going back to the inspiration for Muon, the key idea is an orthonormal update, which sparked the search for more scalable alternative linear algebras realizing the same goal. That’s exactly what Dion is. We have open-sourced this new optimizer to enable anyone to train large models more efficiently at scale.  

What’s an orthonormal update?

Illustration of matrix parameters
Figure1. Illustration of matrix parameters

At the core of Transformers, a set of input activations is multiplied by a learned weight matrix to produce a new set of output activations. When the weight matrix is updated during training, the resulting change in the output activations generally depends on the direction of the input activations. As a result, the learning rate must be chosen conservatively to accommodate the input direction that induces the largest change. Orthonormalized updates alter this behavior by (approximately) making the change in output activations invariant to the direction of the input. This is achieved by enforcing orthonormality (opens in new tab) on the update matrix, thereby equalizing its effect across all input directions.

What is Dion?

While Muon has shown strong empirical results, scaling it to very large models poses challenges. As reported by Essential AI (opens in new tab), applying Muon to large architectures like LLaMA-3 becomes compute-bound—and potentially communication-bound—due to the cost of the Newton–Schulz orthonormalization steps (opens in new tab).

Pseudocode of the centralized version of Dion
Figure 2. Pseudocode of the centralized version of Dion

This is where Dion enters. At a high level, Dion introduces a new axis for scalability: the rank. Specifically, for a given rank r, Dion orthonormalizes only the top r of the singular vector space, reducing communication and compute overhead while preserving performance. Empirically, we observe that the necessary rank for good performance grows much more slowly than the number of parameters in larger models.

Dion implements orthonormalization using amortized power iteration (opens in new tab). Power iteration typically pulls out the largest singular value by repeated matrix multiplication. By amortizing this process over optimization steps—applied to the slowly-evolving momentum matrix—we reduce the cost to just two matrix multiplications per step. Incorporating a QR decomposition allows us to extract an approximate orthonormal basis spanning the top singular directions, rather than just the leading one. This amortized power iteration is fully compatible with standard distributed training techniques such as FSDP and tensor parallelism. Here, we show a simple centralized version, but the technique works for more complex forms of parallelization as presented in the paper. In other words, we can orthogonalize a matrix without ever seeing a full row or column of it. 

Low-rank approximation would ordinarily introduce error, but Dion overcomes this through an error feedback mechanism. This keeps the residual of low rank approximation in the momentum matrix so that any systematic gradient structure not initially captured accumulates to eventually be applied in a future update.

Spotlight: AI-POWERED EXPERIENCE

Microsoft research copilot experience

Discover more about research at Microsoft through our AI-powered experience

Opens in a new tab

How does it work?

Something very strange happened in our experiments. Usually, adding an extra constraint on the way an algorithm works can be expected to decrease overall performance. And indeed, at the 120M parameter scale of the speedrun, we see Dion’s update taking more time than Muon, while not yielding any significant gains. But at larger scales, we observed a different trend: Dion began to outperform Muon.

Wall-clock time speedup of Dion for 3B model training
Figure 3. Wall-clock time speedup of Dion for 3B model training

Why would adding a constraint improve the update rule? The answer lies in what the constraint enforces. Dion achieves a much closer approximation to true orthonormalization than Muon. This precision, initially subtle, becomes increasingly important as the number of singular vectors grows. Over increasing model scale and training steps, this small advantage accumulates—leading to a measurable improvement in performance.

This edge further grows with batch size—with larger batches the update quality tends to degrade, but notably more slowly with Dion than Muon (and Muon is already a significant improvement over AdamW).

Scaling of Dion across different batch sizes
Figure 4. Scaling of Dion across different batch sizes

Here you can see how the number of steps to reach a pretraining loss compared to AdamW varies as batch size grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).   

In our experiments, these benefits extend to various post-training regimes as well.

We also experimented with rank, discovering empirically that larger models tolerate smaller rank well.

Low-rank Dion across different model sizes
Figure 5. Low-rank Dion across different model sizes

Projecting this trend out to the scale of the LLaMA-3 (opens in new tab) 405B parameter models suggests that Dion is fully effective even with rank fractions as low as 1/16 or 1/64 for large dense models like LLaMA-3.    

Using hardware timings of the individual update steps suggests a story that looks this:

Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.
Figure 6. Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, available via a simple pip install. Our goal is to make faster training with Dion accessible to everyone. As a bonus, the repository also includes a PyTorch FSDP2 implementation of Muon.

Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for their helpful feedback on the writing and presentation.

Opens in a new tab



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleAI search startup Profound raises $35M as Sequoia backs its bid to be the Salesforce of marketing’s next era
Next Article White House Says Smithsonian Must ‘Celebrate American Exceptionalism’
Advanced AI Editor
  • Website

Related Posts

Using AI to assist in rare disease diagnosis

September 22, 2025

Tool-space interference in the MCP era: Designing for agent compatibility at scale

September 11, 2025

RenderFormer: How neural networks are reshaping 3D rendering

September 10, 2025

Comments are closed.

Latest Posts

Record Exec and Art Collector Gets Over 4 Years

Chicago’s Art Scene Offers a Beacon of Hope for Artists and Dealers

Pace to Close Hong Kong Gallery at H Queen’s This Month

Taylor Swift’s ‘Fate of Ophelia’ Has a Lot in Common with This Artwork

Latest Posts

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs – Takara TLDR

October 4, 2025

Huawei Ascend Roadmap Could Challenge Nvidia AI Leadership

October 4, 2025

No More Pikachu Oppenheimer? OpenAI Promises Rightsholders More Control Over Sora Creations

October 4, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Recent Posts

  • MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs – Takara TLDR
  • Huawei Ascend Roadmap Could Challenge Nvidia AI Leadership
  • No More Pikachu Oppenheimer? OpenAI Promises Rightsholders More Control Over Sora Creations
  • Sam Altman says Sora will add ‘granular,’ opt-in copyright controls
  • Rethinking the shape convention of an MLP – Takara TLDR

Recent Comments

  1. Ruth Scobee on Nuclear power investment is growing. These stocks offer exposure
  2. Rodneyhat on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  3. Jayson Fuemmeler on C3 AI and Arcfield Announce Partnership to Accelerate AI Capabilities to Serve U.S. Defense and Intelligence Communities
  4. JEETA Official | Cá cược và Sòng bạc Trực tuyến Tốt nhất tại Bangladesh on 1-800-CHAT-GPT—12 Days of OpenAI: Day 10
  5. Carlton Man on VAST Data Powers Smarter, Evolving AI Agents with NVIDIA Data Flywheel

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

LinkedIn Instagram YouTube Threads X (Twitter)
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.