Paper page - Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? 😡

In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling 𝐟𝐚𝐥𝐬𝐞-𝐧𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬!

Preprint: https://arxiv.org/abs/2505.16967

🌟𝐏𝐫𝐞𝐥𝐢𝐦𝐢𝐧𝐚𝐫𝐲
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! 🤯

🚀 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐏𝐫𝐮𝐧𝐢𝐧𝐠
1️⃣ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2️⃣ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3️⃣ This shows that some datasets are harmful to model performance.

📊 𝐅𝐚𝐥𝐬𝐞 𝐍𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬
In pruned training datasets, we observe a common issue of “false negatives”: where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (𝐑𝐋𝐇𝐍) to identify and relabel these false negatives in training datasets.

We carefully measure three operations with identified false negatives in training pairs:
1️⃣ Remove: Discard the training pair completely with a false negative.
2️⃣ HN Remove: Discard only the false negatives from the list of hard negatives
3️⃣ 𝐑𝐋𝐇𝐍: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.

📊 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐚𝐥 𝐑𝐞𝐬𝐮𝐥𝐭𝐬
𝐑𝐋𝐇𝐍 gains the best improvement in retrievers and rerankers in contrast to other approaches. 𝐑𝐋𝐇𝐍 starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.

We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.

Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn

Source link

What's Hot

For Now, AI Helps IBM’s Bottom Line More Than Its Top Line

Why AI is making us lose our minds (and not in the way you’d think)

China PM warns against a global AI ‘monopoly’

Paper page – Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Discovering and using Spelke segments

Paper page – Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Paper page – LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

David Geffen Sued By Estranged Husband for Breach of Contract

Auction House Will Sell Egyptian Artifact Despite Concern From Experts

Anish Kapoor Lists New York Apartment for $17.75 M.

Street Fighter 6 Community Rocked by AI Art Controversy

For Now, AI Helps IBM’s Bottom Line More Than Its Top Line

Why AI is making us lose our minds (and not in the way you’d think)

China PM warns against a global AI ‘monopoly’

What's Hot

Paper page – Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Related Posts

Subscribe to Updates