New Apple Study Challenges Whether AI Models Truly “reason” Through Problems

In early June, Apple researchers released a study suggesting that simulated reasoning (SR) models, such as OpenAI’s o1 and o3, DeepSeek-R1, and Claude 3.7 Sonnet Thinking, produce outputs consistent with pattern-matching from training data when faced with novel problems requiring systematic thinking. The researchers found similar results to a recent study by the United States of America Mathematical Olympiad (USAMO) in April, showing that these same models achieved low scores on novel mathematical proofs.

The new study, titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity,” comes from a team at Apple led by Parshin Shojaee and Iman Mirzadeh, and it includes contributions from Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.

The researchers examined what they call “large reasoning models” (LRMs), which attempt to simulate a logical reasoning process by producing a deliberative text output sometimes called “chain-of-thought reasoning” that ostensibly assists with solving problems in a step-by-step fashion.

To do that, they pitted the AI models against four classic puzzles—Tower of Hanoi (moving disks between pegs), checkers jumping (eliminating pieces), river crossing (transporting items with constraints), and blocks world (stacking blocks)—scaling them from trivially easy (like one-disk Hanoi) to extremely complex (20-disk Hanoi requiring over a million moves).

Figure 1 from Apple's "The Illusion of Thinking" research paper. — Figure 1 from Apple’s “The Illusion of Thinking” research paper.

Credit:

Apple

“Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy,” the researchers write. In other words, today’s tests only care if the model gets the right answer to math or coding problems that may already be in its training data—they don’t examine whether the model actually reasoned its way to that answer or simply pattern-matched from examples it had seen before.

Ultimately, the researchers found results consistent with the aforementioned USAMO research, showing that these same models achieved mostly under 5 percent on novel mathematical proofs, with only one model reaching 25 percent, and not a single perfect proof among nearly 200 attempts. Both research teams documented severe performance degradation on problems requiring extended systematic reasoning.

Source link

What's Hot

Anthropic Claude now has memory, catching up to competitors Gemini and ChatGPT

Karen Hao on the Empire of AI, AGI evangelists, and the cost of belief

DeepMind’s Demis Hassabis says calling AI PhD Intelligences is ‘Nonsense’

New Apple study challenges whether AI models truly “reason” through problems

IBM announced the world’s first HDD, the 3.75MB RAMAC 350 disk storage unit, 69 years ago today — unit weighed more than a ton, 50 platters ran at 1,200 RPM

IBM Is Making the Quantum Leap, But Does That Make the Stock a Buy Now?

IBM’s Head of VC Shares 5 Pillars That Drive Her Startup Investments

Ohio Auction of Two Paintings Looted By Nazis Halted By Foundation

Lee Ufan Painting at Center of Bribery Investigation in Korea

Drought Reveals 40 Ancient Tombs in Northern Iraqi Reservoir

Artifacts Removed from Gaza Building Before Suspected Israeli Strike

Anthropic Claude now has memory, catching up to competitors Gemini and ChatGPT

Karen Hao on the Empire of AI, AGI evangelists, and the cost of belief

DeepMind’s Demis Hassabis says calling AI PhD Intelligences is ‘Nonsense’

What's Hot

New Apple study challenges whether AI models truly “reason” through problems

Related Posts

Subscribe to Updates