View a PDF of the paper titled Position: Theory of Mind Benchmarks are Broken for Large Language Models, by Matthew Riemer and 6 other authors
View PDF
HTML (experimental)
Abstract:Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models (LLMs) adapt to new partners. This problem stems from the fact that theory of mind benchmarks for LLMs are overwhelmingly inspired by the methods used to test theory of mind in humans and fall victim to a fallacy of attributing human-like qualities to AI agents. We expect that humans will engage in a consistent reasoning process across various questions about a situation, but this is known to not be the case for current LLMs. Most theory of mind benchmarks only measure what we call literal theory of mind: the ability to predict the behavior of others. However, this type of metric is only informative when agents exhibit self-consistent reasoning. Thus, we introduce the concept of functional theory of mind: the ability to adapt to agents in-context following a rational response to their behavior. We find that many open source LLMs are capable of displaying strong literal theory of mind capabilities, but seem to struggle with functional theory of mind — even with exceedingly simple partner policies. Simply put, strong literal theory of mind performance does not necessarily imply strong functional theory of mind performance or vice versa. Achieving functional theory of mind, particularly over long interaction horizons with a partner, is a significant challenge deserving a prominent role in any meaningful LLM theory of mind evaluation.
Submission history
From: Matthew Riemer [view email]
[v1]
Fri, 27 Dec 2024 16:30:12 UTC (9,329 KB)
[v2]
Wed, 5 Feb 2025 19:27:20 UTC (10,291 KB)
[v3]
Fri, 6 Jun 2025 05:06:52 UTC (9,864 KB)
[v4]
Thu, 12 Jun 2025 14:58:31 UTC (9,870 KB)