View a PDF of the paper titled The Impact of Input Order Bias on Large Language Models for Software Fault Localization, by Md Nakhla Rafi and 3 other authors
View PDF
HTML (experimental)
Abstract:Large Language Models (LLMs) have shown significant potential in software engineering tasks such as Fault Localization (FL) and Automatic Program Repair (APR). This study investigates how input order and context size influence LLM performance in FL, a crucial step for many downstream software engineering tasks. We evaluate different method orderings using Kendall Tau distances, including “perfect” (where ground truths appear first) and “worst” (where ground truths appear last), across two benchmarks containing Java and Python projects. Our results reveal a strong order bias: in Java projects, Top-1 FL accuracy drops from 57% to 20% when reversing the order, while in Python projects, it decreases from 38% to approximately 3%. However, segmenting inputs into smaller contexts mitigates this bias, reducing the performance gap in FL from 22% and 6% to just 1% across both benchmarks. We replaced method names with semantically meaningful alternatives to determine whether this bias is due to data leakage. The observed trends remained consistent, suggesting that the bias is not caused by memorization from training data but rather by the inherent effect of input order. Additionally, we explored ordering methods based on traditional FL techniques and metrics, finding that DepGraph’s ranking achieves 48% Top-1 accuracy, outperforming simpler approaches such as CallGraph(DFS). These findings highlight the importance of structuring inputs, managing context effectively, and selecting appropriate ordering strategies to enhance LLM performance in FL and other software engineering applications.
Submission history
From: Md Nakhla Rafi [view email]
[v1]
Wed, 25 Dec 2024 02:48:53 UTC (1,962 KB)
[v2]
Wed, 19 Mar 2025 16:08:36 UTC (3,200 KB)
[v3]
Mon, 23 Jun 2025 15:51:16 UTC (1,073 KB)