Paper Page - C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization For Test-Time Expert Re-Mixing

Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely
sub-optimal expert pathways-our study reveals that naive expert selection
learned from pretraining leaves a surprising 10-20% accuracy gap for
improvement. Motivated by this observation, we develop a novel class of
test-time optimization methods to re-weight or “re-mixing” the experts in
different layers jointly for each test sample. Since the test sample’s ground
truth is unknown, we propose to optimize a surrogate objective defined by the
sample’s “successful neighbors” from a reference set of samples. We introduce
three surrogates and algorithms based on mode-finding, kernel regression, and
the average loss of similar reference samples/tasks. To reduce the cost of
optimizing whole pathways, we apply our algorithms merely to the core experts’
mixing weights in critical layers, which enjoy similar performance but save
significant computation. This leads to “Critical-Layer, Core-Expert,
Collaborative Pathway Optimization (C3PO)”. We apply C3PO to two recent MoE
LLMs and examine it on six widely-used benchmarks. It consistently improves the
base model by 7-15% in accuracy and outperforms widely used test-time learning
baselines, e.g., in-context learning and prompt/prefix tuning, by a large
margin. Moreover, C3PO enables MoE LLMs with 1-3B active parameters to
outperform LLMs of 7-9B parameters, hence improving MoE’s advantages on
efficiency. Our thorough ablation study further sheds novel insights on
achieving test-time improvement on MoE.

Source link

What's Hot

Meta Teams Up With US Government To Bring Llama AI Models To Every Federal Agency – Meta Platforms (NASDAQ:META)

Google DeepMind Updates AI Safety Rules to Counter ‘Harmful Manipulation’ and Models That Resist Shutdown

China Market Update: From TikTok To OpenAI: US-China Business Deals Surface

Paper page – C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning – Takara TLDR

BaseReward: A Strong Baseline for Multimodal Reward Model – Takara TLDR

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer – Takara TLDR

New Collectors Drive Strong Sales at New York Fair

Hidden Portrait May Be Vermeer’s Earliest Known Work

Who Are the Art World Figures on the Time 100 List?

Acquavella Signs Harumi Klossowska de Rola, Daughter of Balthus

Meta Teams Up With US Government To Bring Llama AI Models To Every Federal Agency – Meta Platforms (NASDAQ:META)

Google DeepMind Updates AI Safety Rules to Counter ‘Harmful Manipulation’ and Models That Resist Shutdown

China Market Update: From TikTok To OpenAI: US-China Business Deals Surface

What's Hot

Paper page – C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing

Related Posts

Subscribe to Updates