AI critic Gary Marcus is smiling again, thanks to Apple.
In a new paper titled The Illusion of Thinking, researchers from the Cupertino-based company argue that even the most advanced AI models, including the so-called large reasoning models (LRMs), don’t actually think. Instead, they simulate reasoning without truly understanding or solving complex problems.
The paper, released just ahead of Apple’s Worldwide Developer Conference, tested leading AI models, including OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, using specially designed algorithmic puzzle environments rather than standard benchmarks.
The researchers argue that traditional benchmarks, like math and coding tests, are flawed due to “data contamination” and fail to reveal how these models actually “think”.
“We show that state-of-the-art LRMs still fail to develop generalisable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments,” the paper noted.
Interestingly, one of the authors of the paper is Samy Bengio, the brother of Turing Award winner Yoshua Bengio. Yoshua recently launched LawZero, a Canada-based nonprofit AI safety lab working on building systems that prioritise truthfulness, safety, and ethical behaviour over commercial interests.
The lab has secured around $30 million in initial funding from prominent backers, including former Google CEO Eric Schmidt’s philanthropic organisation, Skype co-founder Jaan Tallinn, Open Philanthropy, and the Future of Life Institute.
Backing the paper’s claims, Marcus could not hold his excitement. “AI is not hitting a wall. But LLMs probably are (or at least a point of diminishing returns). We need new approaches, and to diversify the which roads are being actively explored.”
“I don’t think LLMs are a good way to get there (AGI). They might be part of the answer, but I don’t think they are the whole answer,” Marcus said in a previous interaction with AIM, stressing that LLMs are not “useless”. He also expressed optimism about AGI, describing it as a machine capable of approaching new problems with the flexibility and resourcefulness of a smart human being. “I think we’ll see it someday,” he further said.
Taking a more balanced view, Ethan Mollick, professor at The Wharton School, said in a post on X, “I think the Apple paper on the limits of reasoning models in particular tests is useful & important, but the “LLMs are hitting a wall” narrative on X around it feels premature at best. Reminds me of the buzz over model collapse—limitations that were overcome quickly in practice.”
He added that the current approach to reasoning likely has real limitations for a variety of reasons. However, the reasoning approaches themselves were made public less than a year ago. “There are just a lot of approaches that might overcome these issues. Or they may not. It’s just very early.”
Hemanth Mohapatra, partner at Lightspeed India, said that the recent Apple paper showing reasoning struggles with complex problems confirms what many experts, like Yann LeCun, have long sensed. He acknowledged that while a new direction is necessary, current AI capabilities still promise significant productivity gains.
“We do need a different hill to climb, but that doesn’t mean existing capabilities won’t have huge impact on productivity,” he said.
Meanwhile, Subbarao Kambhampati, professor at Arizona State University, who has been pretty vocal about LLMs’ inability to reason and think, quipped that another advantage of being a university researcher in AI is, “You don’t have to deal with either the amplification or the backlash as a surrogate for ‘The Company’. Your research is just your research, fwiw.”
How the Models Were Tested
Instead of relying on familiar benchmarks, Apple’s team used controlled puzzle environments, such as variants of the Tower of Hanoi, to precisely manipulate problem complexity and observe how models generate step-by-step “reasoning traces”. This allowed them to see not just the final answer, but the process the model used to get there.
The paper found that for simpler problems, non-reasoning models often outperformed more advanced LRMs, which tended to “overthink” and miss the correct answer.
As the difficulty level rose to moderate, the reasoning models showed their strength, successfully following more intricate logical steps. However, when faced with truly complex puzzles, all models, regardless of their architecture, struggled and ultimately failed.
Rather than putting in more effort, the AI responses grew shorter and less thoughtful, as if the models were giving up.
While large language models continue to struggle with complex reasoning, that doesn’t make them useless.
Abacus.AI CEO Bindu Reddy pointed out on X, many people are misinterpreting the paper as proof that LLMs don’t work. “All this paper is saying is LLMs can’t solve arbitrarily hard problems yet,” she said, adding that they’re already handling tasks beyond the capabilities of most humans.
Why Does This Happen?
The researchers suggest that what appears to be reasoning is often just the retrieval and adaptation of memorised solution templates from training data, not genuine logical deduction.
When confronted with unfamiliar and highly complex problems, the models’ reasoning abilities tend to collapse almost immediately, revealing that what appears to be reasoning is often just an illusion of thought.
The study makes it clear that current large language models are still far from being true general-purpose reasoners. Their ability to handle reasoning tasks does not extend beyond a certain level of complexity, and even targeted efforts to train them with the correct algorithms result in only minor improvements.
Cover up for Siri’s failure?
Andrew White, co-founder of FutureHouse, questioned Apple’s approach, saying that its AI researchers seem to have adopted an “anti-LLM cynic ethos” by repeatedly publishing papers that argue reasoning LLMs are fundamentally limited and lack generalisation ability. He pointed out the irony, saying Apple has “the worst AI products” like Siri and Apple Intelligence, and admitted he has no idea what their actual strategy is.
What This Means for the Future
Apple’s research serves as a cautionary message for AI developers and users alike. While today’s chatbots and reasoning models appear impressive, their core abilities remain limited. As the paper puts it, “despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain complexity thresholds.”
“We need models that can represent and manipulate abstract structures, not just predict tokens. Hybrid systems that combine LLMs with symbolic logic, memory modules, or algorithmic planners are showing early promise. These aren’t just add-ons — they reshape how the system thinks,” said Pradeep Sanyal, AI and data leader at a global tech consulting firm, in a LinkedIn post.
He further added that combining neural and symbolic parts isn’t without drawbacks. It introduces added complexity around coordination, latency, and debugging. But the improvements in precision and transparency make it a direction worth exploring.