Optimal Sparsity Of Mixture-of-Experts Language Models For Reasoning Tasks - Takara TLDR

Empirical scaling laws have driven the evolution of large language models
(LLMs), yet their coefficients shift whenever the model architecture or data
pipeline changes. Mixture-of-Experts (MoE) models, now standard in
state-of-the-art systems, introduce a new sparsity dimension that current
dense-model frontiers overlook. We investigate how MoE sparsity influences two
distinct capability regimes: memorization and reasoning. We train families of
MoE Transformers that systematically vary total parameters, active parameters,
and top-$k$ routing while holding the compute budget fixed. For every model we
record pre-training loss, downstream task loss, and task accuracy, allowing us
to separate the train-test generalization gap from the loss-accuracy gap.
Memorization benchmarks improve monotonically with total parameters, mirroring
training loss. By contrast, reasoning performance saturates and can even
regress despite continued gains in both total parameters and training loss.
Altering top-$k$ alone has little effect when active parameters are constant,
and classic hyperparameters such as learning rate and initialization modulate
the generalization gap in the same direction as sparsity. Neither post-training
reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning
deficit of overly sparse models. Our model checkpoints, code and logs are
open-source at https://github.com/rioyokotalab/optimal-sparsity.

Source link

What's Hot

IBM And AMD Partner On Quantum Computing, Nvidia Advances Robotics – Advanced Micro Devices (NASDAQ:AMD)

Robotic software startup FieldAI lands $405M in fresh funding

Nvidia reports record sales as the AI boom continues

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks – Takara TLDR

Autoregressive Universal Video Segmentation Model – Takara TLDR

Unraveling the cognitive patterns of Large Language Models through module communities – Takara TLDR

ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models – Takara TLDR

Claire Oliver Gallery Expands in New York’s Harlem Neighborhood

Van Gogh Museum Threatens Dutch Government with Closure

AWAW and NYFA Award $521,125 in Environmental Art Grants

A Well-Preserved Roman Mausoleum Unearthed in France

IBM And AMD Partner On Quantum Computing, Nvidia Advances Robotics – Advanced Micro Devices (NASDAQ:AMD)

Robotic software startup FieldAI lands $405M in fresh funding

Nvidia reports record sales as the AI boom continues

What's Hot

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks – Takara TLDR

Related Posts

Subscribe to Updates