The paper provides the first systematic hallucination evaluation of multilingual conversational LLM outputs (GPT-3.5, GPT-4o, Llama-3.1, Gemma-2.0, DeepSeek-R1, Qwen-3) across Hindi, Farsi, and Mandarin, revealing high hallucination in Hindi/Farsi versus minimal hallucination in Mandarin, and proposes benchmark-style evaluations using translated dialogue corpora.
➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐨𝐮𝐫 𝐋𝐨𝐰-𝐑𝐞𝐬𝐨𝐮𝐫𝐜𝐞 𝐇𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧 𝐁𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤:
🧪 𝑴𝒖𝒍𝒕𝒊𝒍𝒊𝒏𝒈𝒖𝒂𝒍 𝑪𝒐𝒏𝒗𝒆𝒓𝒔𝒂𝒕𝒊𝒐𝒏𝒂𝒍 𝑯𝒂𝒍𝒍𝒖𝒄𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑬𝒗𝒂𝒍𝒖𝒂𝒕𝒊𝒐𝒏:
Introduces a hallucination benchmark for three low-resource languages (Hindi, Farsi, Mandarin) using LLM-translated versions of BlendedSkillTalk and DailyDialog datasets, evaluating model responses against ROUGE-1 and ROUGE-L scores with human verification.
🧩 𝑪𝒐𝒎𝒑𝒂𝒓𝒂𝒕𝒊𝒗𝒆 𝑨𝒏𝒂𝒍𝒚𝒔𝒊𝒔 𝒂𝒄𝒓𝒐𝒔𝒔 𝑳𝑳𝑴 𝑭𝒂𝒎𝒊𝒍𝒊𝒆𝒔 𝒂𝒏𝒅 𝑳𝒂𝒏𝒈𝒖𝒂𝒈𝒆𝒔:
Finds that GPT-4o and GPT-3.5 outperform open-source models (LLaMA, Gemma, DeepSeek, Qwen) in minimizing hallucinations, especially in Mandarin; however, all models hallucinate more in Hindi and Farsi, indicating limitations of current LLMs under low-resource settings.
🧠 𝑹𝒆𝒔𝒐𝒖𝒓𝒄𝒆-𝑨𝒘𝒂𝒓𝒆 𝑯𝒂𝒍𝒍𝒖𝒄𝒊𝒏𝒂𝒕𝒊𝒐𝒏 𝑷𝒂𝒕𝒕𝒆𝒓𝒏𝒔 𝒂𝒏𝒅 𝑭𝒊𝒙𝒆𝒔:
Attributes hallucination differences to training data availability; proposes use of retrieval-augmented generation (RAG), grounded decoding, and language-specific fine-tuning to improve factuality in low-resource conversational agents, with native-speaker evaluation confirming hallucination types (partial vs. complete).