QbitAI has learned that the Qwen teamhas released its next-generation model architecture—Qwen3-Next. This update brings a preview version of Qwen3.5 and open-sources the Qwen3-Next-80B-A3B-Base model. This model has achieved significant performance improvements while drastically reducing inference costs, indicating a new trend in the development of large models.
Innovative Hybrid Architecture: GatedDeltaNet and Hybrid Attention Mechanism
One of the core improvements of Qwen3-Next is its innovative hybrid attention mechanism. To address the limitations of linear attention in processing long contexts and the high overhead of standard attention calculations, the Qwen team introduced GatedDeltaNet. GatedDeltaNet excels in contextual learning capability and employs a 3:1 hybrid strategy (75% layers using GatedDeltaNet and 25% layers retaining standard attention), balancing performance and efficiency. Within the standard attention layers, the team further optimized the output gating mechanism, expanded the attention head dimensions, and introduced rotary position encoding to enhance long-sequence extrapolation capabilities.
High Sparsity MoE Architecture and Training Optimization
Qwen3-Next adopts a highly sparse MoE architecture, with a total parameter count of 80 billion, but only about 3 billion parameters are activated during each inference. This design maximizes resource utilization while ensuring performance. Additionally, the team employed Zero-Centered RMSNorm and applied weight decay to norm weights to enhance model stability. By initializing the parameters of the MoE router, it ensures that each expert can be selected unbiasedly in the early stages of training, reducing the impact of initialization on experimental results. These optimizations aim to improve the stability and efficiency of model training.
Multi-Token Prediction Mechanism and Performance Leap
Qwen3-Next introduces a native Multi-Token Prediction (MTP) mechanism, which not only enhances the overall performance of the model backbone but also improves the acceptance rate of Speculative Decoding through specialized optimizations. Thanks to these innovations, Qwen3-Next has achieved significant performance improvements. With only 15T tokens of pre-training corpus, the GPU hours required for training are less than 80% of that of Qwen3-30A-3B. Compared to Qwen3-32B, Qwen3-Next-80B-A3B has nearly 7 times the throughput during the pre-filling phase and more than 10 times in contexts longer than 32k. During the decoding phase, the throughput for 4k context improves by about 4 times, maintaining over 10 times the throughput advantage in long-context scenarios. Based on Qwen3-Next, the Qwen team also released Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking, both demonstrating excellent performance across multiple benchmark tests, even surpassing the closed-source model Gemini-2.5-Flash-Thinking.
Measured Performance: AIME Competition Problems and Programming Applications
In practical applications, Qwen3-Next-80B-A3B exhibits strong reasoning capabilities. On the QwenChat webpage, the model almost instantly solved AIME math competition problems, providing detailed problem-solving thoughts and answers. In programming, the model was able to generate p5js code for a Minesweeper game. These practical results fully demonstrate the outstanding performance of Qwen3-Next across different tasks. The rapid advancements in the field of artificial intelligenceundoubtedly bring new vitality to the industry with the release of Qwen3-Next.
Cost-Effectiveness and Future Outlook
While improving performance, Qwen3-Next has also significantly reduced training costs. According to official data, the training cost of Qwen3-Next-80B-A3B is only one-tenth of that of Qwen3-32B. This enhancement in cost-effectivenessis expected to drive the application of AI technologyin more fields. In the future, as technology continues to advance, we have reason to believe that large models will achieve greater breakthroughs in performance, efficiency, and cost. What technological innovations do you think will play a key role in the future development of large models?
返回搜狐,查看更多
平台声明:该文观点仅代表作者本人,搜狐号系信息发布平台,搜狐仅提供信息存储空间服务。