Paper page - BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

In this paper, we mainly address two challenges faced by existing MoE architectures:

Performance compromise caused by imperfect routing, especially the non-differentiability and inflexibility issue of vanilla routing paradigms;
Acceleration unfriendliness caused by low chunk-level sparsity (CLS), especially under the conditions where multiple tokens are processed simultaneously, such as offloading and speculative decoding.

To address the above challenges, we introduce BlockFFN, a novel MoE architecture, as well
as its training techniques and efficient end-side deployment.

For model architectures, we propose BlockFFN, a novel MoE paradigm that minimizes
performance compromise through the router module, incorporating ReLU activation and RMSNorm. Through experiments, we demonstrate its better performance compared to other MoE baselines such as TopK, DeepSeekMoE, GRIN, and ReMoE.
For training techniques, we introduce CLS-aware training objectives to improve the CLS
of BlockFFN as well as the vanilla token-level sparsity (TLS). In experiments, we obtain average TLS values higher than 80% and 8-token CLS values higher than 70%.
For end-side deployment, we implement efficient acceleration kernels for BlockFFN, combining activation sparsity and speculative decoding for the first time. On NVIDIA Jetson Orin NX, the kernel achieves an acceleration ratio of 3.67x, compared to the baseline auto-regressive (AR) decoding.

Source link

What's Hot

AI’s fourth wave is here — are enterprises ready for what’s next?

Malaysia will require trade permits for U.S. AI chips

NVIDIA’s New AI: Impossible Weather Graphics!

Paper page – BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

One Token to Fool LLM-as-a-Judge

Paper page – From One to More: Contextual Part Latents for 3D Generation

Paper page – Robust Multimodal Large Language Models Against Modality Conflict

Murujuga Rock Art in Australia Receives UNESCO World Heritage Status

Homeland Security Targets Chicago’s National Museum of Puerto Rican Arts & Culture

1,600-Year-Old Tomb of Mayan City’s Founding King Discovered in Belize

Centre Pompidou Cancels Caribbean Art Show, Raising Controversy

AI’s fourth wave is here — are enterprises ready for what’s next?

Malaysia will require trade permits for U.S. AI chips

NVIDIA’s New AI: Impossible Weather Graphics!

What's Hot

Paper page – BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Related Posts

Subscribe to Updates