Learning To Route LLMs From Bandit Feedback: One Policy, Many Trade-offs - Takara TLDR

Efficient use of large language models (LLMs) is critical for deployment at
scale: without adaptive routing, systems either overpay for strong models or
risk poor performance from weaker ones. Selecting the right LLM for each query
is fundamentally an online decision problem: models differ in strengths, prices
fluctuate, and users value accuracy and cost differently. Yet most routers are
trained offline with labels for all candidate models, an assumption that breaks
in deployment, where only the outcome of the chosen model is observed. We
bridge this gap with BaRP, a Bandit-feedback Routing with Preferences approach
that trains under the same partial-feedback restriction as deployment, while
supporting preference-tunable inference: operators can dial the
performance/cost trade-off at test time without retraining. Framed as a
contextual bandit over prompt features and a user preference vector, our method
simulates an online feedback setting during training and adapts its routing
decisions to each new prompt, rather than depending on full-information offline
supervision. Comprehensive experiments show that our method consistently
outperforms strong offline routers by at least 12.46% and the largest LLM by at
least 2.45%, and generalizes robustly for unseen tasks.

Source link

What's Hot

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense – Takara TLDR

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations – Takara TLDR

OpenAI Will Stop Saving Users’ Deleted Posts

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs – Takara TLDR

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense – Takara TLDR

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations – Takara TLDR

When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs – Takara TLDR

Smithsonian Closes Museums Amid Government Shutdown

The Rubin Names 2025 Art Prize, Research and Art Projects Grants

Kochi-Muziris Biennial Announces 66 Artists for December Exhibition

Instagram Launches ‘Rings’ Awards for Creators—With KAWS as a Judge

Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense – Takara TLDR

GyroSwin: 5D Surrogates for Gyrokinetic Plasma Turbulence Simulations – Takara TLDR

OpenAI Will Stop Saving Users’ Deleted Posts

What's Hot

Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs – Takara TLDR

Related Posts

Subscribe to Updates