In this paper, we mainly address two challenges faced by existing MoE architectures:
Performance compromise caused by imperfect routing, especially the non-differentiability and inflexibility issue of vanilla routing paradigms;
Acceleration unfriendliness caused by low chunk-level sparsity (CLS), especially under the conditions where multiple tokens are processed simultaneously, such as offloading and speculative decoding.
To address the above challenges, we introduce BlockFFN, a novel MoE architecture, as well
as its training techniques and efficient end-side deployment.
For model architectures, we propose BlockFFN, a novel MoE paradigm that minimizes
performance compromise through the router module, incorporating ReLU activation and RMSNorm. Through experiments, we demonstrate its better performance compared to other MoE baselines such as TopK, DeepSeekMoE, GRIN, and ReMoE.
For training techniques, we introduce CLS-aware training objectives to improve the CLS
of BlockFFN as well as the vanilla token-level sparsity (TLS). In experiments, we obtain average TLS values higher than 80% and 8-token CLS values higher than 70%.
For end-side deployment, we implement efficient acceleration kernels for BlockFFN, combining activation sparsity and speculative decoding for the first time. On NVIDIA Jetson Orin NX, the kernel achieves an acceleration ratio of 3.67x, compared to the baseline auto-regressive (AR) decoding.