Paper Page - Fixing Data That Hurts Performance: Cascading LLMs To Relabel Hard Negatives For Robust Information Retrieval

Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? 😡

In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling 𝐟𝐚𝐥𝐬𝐞-𝐧𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬!

Preprint: https://arxiv.org/abs/2505.16967

🌟𝐏𝐫𝐞𝐥𝐢𝐦𝐢𝐧𝐚𝐫𝐲
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! 🤯

🚀 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐏𝐫𝐮𝐧𝐢𝐧𝐠
1️⃣ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2️⃣ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3️⃣ This shows that some datasets are harmful to model performance.

📊 𝐅𝐚𝐥𝐬𝐞 𝐍𝐞𝐠𝐚𝐭𝐢𝐯𝐞𝐬
In pruned training datasets, we observe a common issue of “false negatives”: where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (𝐑𝐋𝐇𝐍) to identify and relabel these false negatives in training datasets.

We carefully measure three operations with identified false negatives in training pairs:
1️⃣ Remove: Discard the training pair completely with a false negative.
2️⃣ HN Remove: Discard only the false negatives from the list of hard negatives
3️⃣ 𝐑𝐋𝐇𝐍: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.

📊 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐚𝐥 𝐑𝐞𝐬𝐮𝐥𝐭𝐬
𝐑𝐋𝐇𝐍 gains the best improvement in retrievers and rerankers in contrast to other approaches. 𝐑𝐋𝐇𝐍 starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.

We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.

Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn

Source link

What's Hot

3D and 4D World Modeling: A Survey – Takara TLDR

National Gallery and Tate Have ‘Bad Blood’—and More Art News

How We Built A Unicorn Without Chasing Hype Cycles

Paper page – Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

3D and 4D World Modeling: A Survey – Takara TLDR

EnvX: Agentize Everything with Agentic AI – Takara TLDR

P3-SAM: Native 3D Part Segmentation – Takara TLDR

National Gallery and Tate Have ‘Bad Blood’—and More Art News

Christie’s Will Auction The First Calculating Machine In History

The Art Market Isn’t Dying. The Way We Write About It Might Be.

Banksy Mural of Judge Beating Protestor Removed by Courts Service

3D and 4D World Modeling: A Survey – Takara TLDR

National Gallery and Tate Have ‘Bad Blood’—and More Art News

How We Built A Unicorn Without Chasing Hype Cycles

What's Hot

Paper page – Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Related Posts

Subscribe to Updates