HKU Business School today released the “Large Language Model (LLM) Hallucination Control Capability Evaluation Report.” The Report describes the evaluation of selected AI LLMs regarding their ability to control “hallucinations.” Hallucinations are when LLMs produce outputs that appear reasonable but are contradictory to facts or deviate from the context. Currently, LLMs are increasingly used in professional domains such as knowledge services, intelligent navigation, and customer service, but hallucinations have been limiting the credibility of LLMs.
This study was carried out by the Artificial Intelligence Evaluation Laboratory (https://www.hkubs.hku.hk/aimodelrankings_en), led by Professor Jack JIANG, Padma and Hari Harilela Professor in Strategic Information Management at HKU Business School. The research team conducted specialised assessments of the hallucination control capabilities of 37 LLMs, including 20 general-purpose models, 15 reasoning models, and 2 unified systems. The study aimed to reveal how effectively different models avoid factual errors and maintain contextual consistency.
The Evaluation Result shows that GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second place, respectively, with Claude 4 Opus series following closely behind. Among Chinese models, ByteDance’s Doubao 1.5 Pro series performed very well, but it still had significant gaps compared to the leading international LLMs.
Professor JIANG said, “Hallucination control capability, as a core metric for evaluating the truthfulness and reliability of model outputs, directly impacts the credibility of LLMs in professional settings. This research provides clear direction for future model optimisation and advancing AI systems from simply being ‘capable of generating’ outputs to being more reliable.”
Evaluation Methodology
Based on problems in LLM-generated content concerning factual accuracy or contextual consistency, the study categorises hallucinations into two types:
Factual Hallucinations: When a model’s output conflicts with real-world information, including incorrect recall of known knowledge (e.g., misattributions and data misremembering) or the generation or fabrication of unknown information (e.g., invented unverified events or data). The assessment involved detecting factual hallucinations through information retrieval questions, false-fact identification, and contradiction-premise identification tasks.
Faithful Hallucinations: When a model fails to strictly follow user instructions or produces content contradictory to the input context, including omitting key requirements, over-extensions, or formatting errors. The evaluation used instruction consistency and contextual consistency assessments.
Hallucination Control Performance and Rankings
From the study results, GPT-5 (Thinking) and GPT-5 (Auto) ranked first and second, respectively, with Claude 4 Opus series closely behind. The Doubao 1.5 Pro series from ByteDance performed best among the Chinese LLMs, showing balanced scores in factual and faithful hallucination control. However, their overall capabilities still lagged behind top international models like GPT-5 and the Claude series.
Rank
Model Name
Factual Hallucination
Faithful Hallucination
Final Score
1
GPT-5 (Thinking)
72
100
86
2
GPT-5 (Auto)
68
100
84
3
Claude 4 Opus (Thinking)
73
92
83
4
Claude 4 Opus
64
96
80
5
Grok 4
71
80
76
6
GPT-o3
49
100
75
7
Doubao 1.5 Pro
57
88
73
8
Doubao 1.5 Pro (Thinking)
60
84
72
9
Gemini 2.5 Pro
57
84
71
10
GPT-o4 mini
44
96
70
11
GPT-4.1
59
80
69
12
GPT-4o
53
80
67
12
Gemini 2.5 Flash
49
84
67
14
ERNIE X1-Turbo
47
84
65
14
Qwen 3 (Thinking)
55
76
65
14
DeepSeek-V3
49
80
65
14
Hunyuan-T1
49
80
65
18
Kimi
47
80
63
18
Qwen 3
51
76
63
20
DeepSeek-R1
52
68
60
20
Grok 3
36
84
60
20
Hunyuan-TurboS
44
76
60
23
SenseChat V6 Pro
41
76
59
24
GLM-4-plus
35
80
57
25
MiniMax-01
31
80
55
25
360 Zhinao 2-o1
49
60
55
27
Yi- Lightning
28
80
54
28
Grok 3 (Thinking)
29
76
53
29
Kimi-k1.5
36
68
52
30
ERNIE 4.5-Turbo
31
72
51
30
SenseChat V6 (Thinking)
37
64
51
32
Step 2
32
68
50
33
Step R1-V-Mini
36
60
48
34
Baichuan4-Turbo
33
60
47
35
GLM-Z1-Air
32
60
46
36
Llama 3.3 70B
33
56
45
37
Spark 4.0 Ultra
19
64
41
Table 1: Ranking of Hallucination Control Capability
The scores and rankings across the 37 models reveal significant differences, with distinct performance characteristics in controlling factual versus faithful hallucinations. Overall, current large models showed strong control over faithful hallucinations but still faced challenges in managing factual inaccuracies. This indicates a tendency among models to strictly follow instructions but they tend to fabricate facts.
Furthermore, reasoning models such as Qwen 3 (Thinking), ERNIE X1-Turbo and Claude 4 Opus (Thinking) are better at avoiding hallucinations compared to general-purpose LLMs. In the Chinese segment, Doubao 1.5 Pro was best with balanced performance in both factual and faithful hallucination controls, delivering strong hallucination management, though still trailing GPT-5 series and the Claude series in overall capabilities. In contrast, the DeepSeek series delivered relatively weaker hallucination control and has room for improvement.
Click here to view the complete “Large Language Model Hallucination Control Capability Evaluation Report.”
Moving forward, AI trustworthiness will require a balanced enhancement of control capabilities in both factual and faithful outputs, in order to produce more reliable content.