This paper proposes and validates a controlled benchmarking framework to detect and quantify linguistic shibboleth bias—subtle language cues like hedging—in LLM-driven hiring evaluations, revealing systematic penalization of certain linguistic styles despite equivalent content.
➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐋𝐢𝐧𝐠𝐮𝐢𝐬𝐭𝐢𝐜 𝐒𝐡𝐢𝐛𝐛𝐨𝐥𝐞𝐭𝐡 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤:
🧪 𝑪𝒐𝒏𝒕𝒓𝒐𝒍𝒍𝒆𝒅 𝑳𝒊𝒏𝒈𝒖𝒊𝒔𝒕𝒊𝒄 𝑽𝒂𝒓𝒊𝒂𝒕𝒊𝒐𝒏 𝑭𝒓𝒂𝒎𝒆𝒘𝒐𝒓𝒌: Introduces a systematic methodology to generate semantically equivalent interview responses that differ only in specific sociolinguistic features (e.g., hedging), enabling attribution of score differences directly to linguistic style bias.
🧩 𝑯𝒆𝒅𝒈𝒊𝒏𝒈 𝑩𝒊𝒂𝒔 𝑪𝒂𝒔𝒆 𝑺𝒕𝒖𝒅𝒚: Constructs a 100-question hiring dataset with paired confident/hedged responses and tests across 7 LLMs, finding hedged answers receive 25.6% lower ratings on average and are more often rejected despite identical content.
🧠 𝑭𝒂𝒊𝒓𝒏𝒆𝒔𝒔 𝑨𝒖𝒅𝒊𝒕 𝑬𝒙𝒕𝒆𝒏𝒔𝒊𝒃𝒊𝒍𝒊𝒕𝒚: Framework generalizes to other shibboleths like accent markers and register variations, providing a reproducible, model-agnostic tool for systematic bias detection and informing debiasing strategies in high-stakes AI decision systems.