Did you know that fine-tuning retrievers & re-rankers on large but unclean training datasets can harm their performance? ๐ก
In our new preprint, we reexamine the quality of popular IR training data by pruning datasets and identifying and relabeling ๐๐๐ฅ๐ฌ๐-๐ง๐๐ ๐๐ญ๐ข๐ฏ๐๐ฌ!
Preprint: https://arxiv.org/abs/2505.16967
๐๐๐ซ๐๐ฅ๐ข๐ฆ๐ข๐ง๐๐ซ๐ฒ
We fine-tune E5 (base) on 16 retrieval datasets from BGE collection (1.6M training pairs) and conduct a leave-one-out analysis: leaving one dataset out and fine-tuning on the rest. Removing ELI5 alone surprisingly can improve nDCG@10 on 7/14 BEIR datasets! ๐คฏ
๐ ๐๐๐ญ๐๐ฌ๐๐ญ ๐๐ซ๐ฎ๐ง๐ข๐ง๐
1๏ธโฃ We effectively prune 8/15 training datasets, leaving 7 datasets, reducing the training pairs by 2.35x (1.6M -> 680K pairs).
2๏ธโฃ E5 (base) fine-tuned on 7 datasets outperforms the model on all 15 datasets, by 1.0 nDCG@10 on BEIR.
3๏ธโฃ This shows that some datasets are harmful to model performance.
๐ ๐
๐๐ฅ๐ฌ๐ ๐๐๐ ๐๐ญ๐ข๐ฏ๐๐ฌ
In pruned training datasets, we observe a common issue of “false negatives”: where hard negatives are incorrectly classified as irrelevant! We propose a LLM judge cascading framework (๐๐๐๐) to identify and relabel these false negatives in training datasets.
We carefully measure three operations with identified false negatives in training pairs:
1๏ธโฃ Remove: Discard the training pair completely with a false negative.
2๏ธโฃ HN Remove: Discard only the false negatives from the list of hard negatives
3๏ธโฃ ๐๐๐๐: Relabel the false negatives as positives, while keeping the remaining list of hard negatives.
๐ ๐๐ฑ๐ฉ๐๐ซ๐ข๐ฆ๐๐ง๐ญ๐๐ฅ ๐๐๐ฌ๐ฎ๐ฅ๐ญ๐ฌ
๐๐๐๐ gains the best improvement in retrievers and rerankers in contrast to other approaches. ๐๐๐๐ starts to show consistent gains even if we label a small subset of training pairs, especially the OOD nDCG@10 on BEIR (Avg. 7) and AIR-Bench (Avg. 5), both improve steadily with more and more clean data.
We also qualitatively analyzed the different categories of identified false negatives, e.g., the query can be ambiguous, which can lead to many hard negatives actually relevant to it.
Paper: https://arxiv.org/abs/2505.16967
Code: https://github.com/castorini/rlhn
Data: https://huggingface.co/rlhn