Paper Page - Multi-Domain Explainability Of Preferences

(based on a thread on Twitter)

Preferences drive modern LLM research and development: from model alignment to evaluation.
But how well do we understand them?

Excited to share our new preprint:
Multi-domain Explainability of Preferences

We propose a fully automated method for explaining the preferences of three mechanism types:
👥 Human preferences (used to train reward models and evaluation)
🤖 LLM-as-a-Judge (de facto standard for automatic evaluation)
🏅 Reward models (used in RLHF/RLAIF for alignment)

Our four-stage method:

Use LLM to discover concepts that distinguish between chosen and rejected responses.
Represent responses as concept vectors.
Train a logistic regression model to predict preferences.
Extract concept importance from model weights.

Our special focus is on multi-domain learning:
Concepts affect preference decisions differently across domains.
A concept that is important in one domain may be irrelevant in another.

To address this, we introduce a white-box Hierarchical Multi-Domain Regression (HMDR) model:

The HMDR model is optimized to:
• Make shared weights strongly predictive → improves OOD generalization.
• Encourage sparsity (L1 regularization) → simpler explanations.

Finally, concept importance is the lift in probability (% change when increasing a concept by one unit)

The resulting explanations are quite interesting 🤩
Below is an example of human preferences across five domains 💬🧑‍💻👩‍⚖️🧑‍🍳🧳

How to read it?
◻️Light bars show the shared contribution to the score,
◼️while dark bars and arrows indicate domain-specific contributions.

How do we know our explanations are good? 🤔
✅ Human Evaluation: LLM concept annotations closely match human annotations.
✅ Preference Prediction: Our method is comparable to human preference models.
The HMDR model outperforms other white-box models both in-domain & OOD.

We assess explanations in two application-driven settings:

Can we “hack” the judge? 👩‍⚖️🤖
Using LLM-as-a-judge explanations, we guide another LLM’s responses (by asking it to follow the top concepts).
Result: Judges prefer the explanation-guided outputs over regular prompts.

Breaking Ties in LLM-as-Judges 🤝
LLMs often produce inconsistent preferences when the order of responses is flipped (10–30% of the time!).

We guide LLM judges using top human-derived concepts to break ties.
Result: Clear gains in human preference alignment on tied cases.

Finally, we analyze our explanations by comparing our findings (auto-discovered concepts) with those from prior studies of manually curated concepts.

🔍 We reproduced many!
Humans prioritize clarity, authority, and confidence, while LLMs emphasize accuracy and helpfulness.

Importantly, we found that domain-specific concepts dominate many preference mechanisms.

Our two key contributions:
1⃣ Automatic concept discovery
2⃣ Multi-domain modeling
Together, they provide a scalable and generalizable approach to modeling NLP preferences.

https://arxiv.org/abs/2505.20088

Source link

What's Hot

Perplexity reportedly raised $200M at $20B valuation

DeepSeek-R1 More Effective in Diagnosis, Management of Ophthalmic Subspecialties Compared With OpenAI

OpenAI and Oracle strike $300B cloud computing deal to power AI

Paper page – Multi-Domain Explainability of Preferences

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning – Takara TLDR

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search – Takara TLDR

Visual Representation Alignment for Multimodal Large Language Models – Takara TLDR

Christie’s Will Auction The First Calculating Machine In History

Ralph Rugoff to Leave London’s Hayward Gallery After 20 Years

New York Foundation for the Arts Workers Move to Unionize

Patrizia Sandretto Re Rebaudengo Teams Up with New Museum

Perplexity reportedly raised $200M at $20B valuation

DeepSeek-R1 More Effective in Diagnosis, Management of Ophthalmic Subspecialties Compared With OpenAI

OpenAI and Oracle strike $300B cloud computing deal to power AI

What's Hot

Paper page – Multi-Domain Explainability of Preferences

Related Posts

Subscribe to Updates