Optimal Sparsity Of Mixture-of-Experts Language Models For Reasoning Tasks - Takara TLDR

Empirical scaling laws have driven the evolution of large language models
(LLMs), yet their coefficients shift whenever the model architecture or data
pipeline changes. Mixture-of-Experts (MoE) models, now standard in
state-of-the-art systems, introduce a new sparsity dimension that current
dense-model frontiers overlook. We investigate how MoE sparsity influences two
distinct capability regimes: memorization and reasoning. We train families of
MoE Transformers that systematically vary total parameters, active parameters,
and top-$k$ routing while holding the compute budget fixed. For every model we
record pre-training loss, downstream task loss, and task accuracy, allowing us
to separate the train-test generalization gap from the loss-accuracy gap.
Memorization benchmarks improve monotonically with total parameters, mirroring
training loss. By contrast, reasoning performance saturates and can even
regress despite continued gains in both total parameters and training loss.
Altering top-$k$ alone has little effect when active parameters are constant,
and classic hyperparameters such as learning rate and initialization modulate
the generalization gap in the same direction as sparsity. Neither post-training
reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning
deficit of overly sparse models. Our model checkpoints, code and logs are
open-source at https://github.com/rioyokotalab/optimal-sparsity.

Source link

What's Hot

Ex-ROSS Cofounder Bags $5.3m* Seed For Judge Intelligence – Artificial Lawyer

ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models – Takara TLDR

Mercury foundation models from Inception Labs are now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks – Takara TLDR

ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models – Takara TLDR

MovieCORE: COgnitive REasoning in Movies – Takara TLDR

FastMesh:Efficient Artistic Mesh Generation via Component Decoupling – Takara TLDR

AWAW and NYFA Award $521,125 in Environmental Art Grants

A Well-Preserved Roman Mausoleum Unearthed in France

France Will Return Colonial-Era Human Remains to Madagascar

Vail Settles with Native American Artist in Suit on Pro-Palestine Art

Ex-ROSS Cofounder Bags $5.3m* Seed For Judge Intelligence – Artificial Lawyer

ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models – Takara TLDR

Mercury foundation models from Inception Labs are now available in Amazon Bedrock Marketplace and Amazon SageMaker JumpStart

What's Hot

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks – Takara TLDR

Related Posts

Subscribe to Updates