(based on a thread on Twitter)
Preferences drive modern LLM research and development: from model alignment to evaluation.
But how well do we understand them?
Excited to share our new preprint:
Multi-domain Explainability of Preferences
We propose a fully automated method for explaining the preferences of three mechanism types:
👥 Human preferences (used to train reward models and evaluation)
🤖 LLM-as-a-Judge (de facto standard for automatic evaluation)
🏅 Reward models (used in RLHF/RLAIF for alignment)
Our four-stage method:
Use LLM to discover concepts that distinguish between chosen and rejected responses.
Represent responses as concept vectors.
Train a logistic regression model to predict preferences.
Extract concept importance from model weights.
Our special focus is on multi-domain learning:
Concepts affect preference decisions differently across domains.
A concept that is important in one domain may be irrelevant in another.
To address this, we introduce a white-box Hierarchical Multi-Domain Regression (HMDR) model:
The HMDR model is optimized to:
• Make shared weights strongly predictive → improves OOD generalization.
• Encourage sparsity (L1 regularization) → simpler explanations.
Finally, concept importance is the lift in probability (% change when increasing a concept by one unit)
The resulting explanations are quite interesting 🤩
Below is an example of human preferences across five domains 💬🧑💻👩⚖️🧑🍳🧳
How to read it?
◻️Light bars show the shared contribution to the score,
◼️while dark bars and arrows indicate domain-specific contributions.
How do we know our explanations are good? 🤔
✅ Human Evaluation: LLM concept annotations closely match human annotations.
✅ Preference Prediction: Our method is comparable to human preference models.
The HMDR model outperforms other white-box models both in-domain & OOD.
We assess explanations in two application-driven settings:
Can we “hack” the judge? 👩⚖️🤖
Using LLM-as-a-judge explanations, we guide another LLM’s responses (by asking it to follow the top concepts).
Result: Judges prefer the explanation-guided outputs over regular prompts.
Breaking Ties in LLM-as-Judges 🤝
LLMs often produce inconsistent preferences when the order of responses is flipped (10–30% of the time!).
We guide LLM judges using top human-derived concepts to break ties.
Result: Clear gains in human preference alignment on tied cases.
Finally, we analyze our explanations by comparing our findings (auto-discovered concepts) with those from prior studies of manually curated concepts.
🔍 We reproduced many!
Humans prioritize clarity, authority, and confidence, while LLMs emphasize accuracy and helpfulness.
Importantly, we found that domain-specific concepts dominate many preference mechanisms.
Our two key contributions:
1⃣ Automatic concept discovery
2⃣ Multi-domain modeling
Together, they provide a scalable and generalizable approach to modeling NLP preferences.
https://arxiv.org/abs/2505.20088