Close Menu
  • Home
  • AI Models
    • DeepSeek
    • xAI
    • OpenAI
    • Meta AI Llama
    • Google DeepMind
    • Amazon AWS AI
    • Microsoft AI
    • Anthropic (Claude)
    • NVIDIA AI
    • IBM WatsonX Granite 3.1
    • Adobe Sensi
    • Hugging Face
    • Alibaba Cloud (Qwen)
    • Baidu (ERNIE)
    • C3 AI
    • DataRobot
    • Mistral AI
    • Moonshot AI (Kimi)
    • Google Gemma
    • xAI
    • Stability AI
    • H20.ai
  • AI Research
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Microsoft Research
    • Meta AI Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Matt Wolfe AI
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Manufacturing AI
    • Media & Entertainment
    • Transportation AI
    • Education AI
    • Retail AI
    • Agriculture AI
    • Energy AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
What's Hot

Google Gemma 3 : Comprehensive Guide to the New AI Model Family

Mistral Rolls Out AI Coding Assistant to Challenge Claude and Copilot

DeepSeek Allegedly Use Google’s Gemini to Train Its Latest AI Model

Facebook X (Twitter) Instagram
Advanced AI News
  • Home
  • AI Models
    • Adobe Sensi
    • Aleph Alpha
    • Alibaba Cloud (Qwen)
    • Amazon AWS AI
    • Anthropic (Claude)
    • Apple Core ML
    • Baidu (ERNIE)
    • ByteDance Doubao
    • C3 AI
    • Cohere
    • DataRobot
    • DeepSeek
  • AI Research & Breakthroughs
    • Allen Institue for AI
    • arXiv AI
    • Berkeley AI Research
    • CMU AI
    • Google Research
    • Meta AI Research
    • Microsoft Research
    • OpenAI Research
    • Stanford HAI
    • MIT CSAIL
    • Harvard AI
  • AI Funding & Startups
    • AI Funding Database
    • CBInsights AI
    • Crunchbase AI
    • Data Robot Blog
    • TechCrunch AI
    • VentureBeat AI
    • The Information AI
    • Sifted AI
    • WIRED AI
    • Fortune AI
    • PitchBook
    • TechRepublic
    • SiliconANGLE – Big Data
    • MIT News
    • Data Robot Blog
  • Expert Insights & Videos
    • Google DeepMind
    • Lex Fridman
    • Meta AI Llama
    • Yannic Kilcher
    • Two Minute Papers
    • AI Explained
    • TheAIEdge
    • Matt Wolfe AI
    • The TechLead
    • Andrew Ng
    • OpenAI
  • Expert Blogs
    • François Chollet
    • Gary Marcus
    • IBM
    • Jack Clark
    • Jeremy Howard
    • Melanie Mitchell
    • Andrew Ng
    • Andrej Karpathy
    • Sebastian Ruder
    • Rachel Thomas
    • IBM
  • AI Policy & Ethics
    • ACLU AI
    • AI Now Institute
    • Center for AI Safety
    • EFF AI
    • European Commission AI
    • Partnership on AI
    • Stanford HAI Policy
    • Mozilla Foundation AI
    • Future of Life Institute
    • Center for AI Safety
    • World Economic Forum AI
  • AI Tools & Product Releases
    • AI Assistants
    • AI for Recruitment
    • AI Search
    • Coding Assistants
    • Customer Service AI
    • Image Generation
    • Video Generation
    • Writing Tools
    • AI for Recruitment
    • Voice/Audio Generation
  • Industry Applications
    • Education AI
    • Energy AI
    • Finance AI
    • Healthcare AI
    • Legal AI
    • Media & Entertainment
    • Transportation AI
    • Manufacturing AI
    • Retail AI
    • Agriculture AI
  • AI Art & Entertainment
    • AI Art News Blog
    • Artvy Blog » AI Art Blog
    • Weird Wonderful AI Art Blog
    • The Chainsaw » AI Art
    • Artvy Blog » AI Art Blog
Advanced AI News
Home » Reports of LLMs mastering math have been greatly exaggerated
Gary Marcus

Reports of LLMs mastering math have been greatly exaggerated

Advanced AI BotBy Advanced AI BotApril 5, 2025No Comments6 Mins Read
Share Facebook Twitter Pinterest Copy Link Telegram LinkedIn Tumblr Email
Share
Facebook Twitter LinkedIn Pinterest Email


The USA Math Olympiad is an extremely challenging math competition for the top US high school students; the top scorers get prizes and an invitation to the International Math Olympiad. The USAMO was held this year March 19-20. Hours after it was completed, so there could be virtually no chance of data leakage, a team of scientists gave the problems to some of the top large language models, whose mathematical and reasoning abilities have been loudly proclaimed: o3-Mini, o1-Pro, DeepSeek R1, QwQ-32B, Gemini-2.0-Flash-Thinking-Exp, and Claude-3.7-Sonnet-Thinking. The proofs output by all these models were evaluated by experts. The results were dismal: None of the AIs scored higher than 5% overall.

To be sure, a poor showing on the USAMO is not in itself a shameful result. These problems are awfully difficult; many professional research mathematicians have to work hard to find the solution. What matters here is the nature of the failure: the AIs were never able to recognize when they had not solved the problem. In every case, rather than give up, they confidently output a proof that had a large gap or an outright error. To quote the report: “The most frequent failure mode among human participants is the inability to find a correct solution. Typically, human participants have a clear sense of whether they solved a problem correctly. In contrast, all evaluated LLMs consistently claimed to have solved the problems.”

The refusal of these kinds of AI to admit ignorance or incapacity and their obstinate preference for generating incorrect but plausible-looking answers instead are one of their most dangerous characteristics. It is extremely easy for a user to pose a question to an LLM, get what looks like a valid answer, and then trust to it, without doing the careful inspection necessary to check that it is actually right.

If this kind of technology becomes commonly used to answer difficult questions before the problem of generating invalid answers is fixed, we will be in serious trouble. Getting AIs to answer “I don’t know” is one of the most important unsolved challenges facing the field.

Importantly, the neurosymbolic method used by DeepMind’s AlphaProof and AlphaGeometry systems (which we discussed recently) which (more or less) achieved a silver-medal level performance on the 2024 International Math Olympiad, is immune to this problem. AlphaProof and AlphaGeometry generate a completely detailed symbolic proof that can be fed into a formal proof verifier. They can fail to find a proof, but they cannot generate an incorrect proof. But that is because they rely in part on powerful, completely hand-written, symbolic reasoning systems. LLMs are not similarly immune.

§

On X, there have been a couple of reactions to abysmal showing of LLMs, trying to defend LLMs, that are worth discussing. AI researcher Lucas Nestler went so far as to accuse the authors of having faked their results, claiming that o3 did better

But it turns out that the o3-generated proof of problem 2 that he posted is itself invalid. (The rather technical details, if you’re interested, are included at the end of this essay.)

So Nestler’s experiment does not contradict the finding of the report; it corroborates it. o3, yet again, produced an invalid answer to this problem. It also confirms how dangerous this kind of failing is. The AI output a “proof” that looked plausible to Nestler led Nestler to make a fool of himself in public by outrageously accusing reputable scientists of fakery. (To a trained mathematician, the error in Nestler’s own proof is pretty obvious once pointed out.)

A more serious response was that of Kevin Bryan, who is an economist at U. Toronto:

There is certainly some truth to Bryan’s comments. It has often been found that suitable “prompt engineering” can improve the performance of LLMs on reasoning tasks. But there are a few points to be made about this.

First, the burden of proof here is surely on those who would claim that better prompts will get better results. Maybe yes, maybe no; one really wants the evidence. (As the example of Lucas Nestler illustrates, they will have to be careful in evaluating those results.) And importantly, the AIs used in this experiment were already specifically designed by their creators to deal with complex reasoning tasks. (For example, the systems likely already include “hidden prompts” encouraging the AI to work carefully and so on, to the degree that the creators of the AI found useful.) It is by no means certain that additional user prompts will add much,.

Second the “You didn’t say ‘Pretty please’!” sort of excuse that LLM enthusiasts always offer for failures in their favored AIs is only sometimes pertinent. It is – maybe – relevant for judging the usefulness of AIs to regular users of the technology who will take the time to optimize their query style and who are experienced in crafting such prompts. Someone who, day in and day out, is using AI to write code or reports or whatever will take the time to learn what kind of prompts get the best and most reliable results. But what about the hundreds of millions of users who are not specialists in prompting? By now, naïve users expect that, when they ask a question, they can get a reliable answer. Reality is otherwise.

Moreover, however you slice it, this need of the AIs for special cajoling before they will think carefully about questions that are presented to them is a serious weakness, and the demand for special prompting tilts the scale. The high school contestants in the USAMO do not have to be encouraged to think hard about the problems; they are altogether aware that that’s necessary. If indeed the AIs could solve more of these problems with better prompts, then that’s evidence in favor of their mathematical ability but it’s evidence against their ability to judge the right thing to do on their own.

Third and most important: As we said before, the important problem here isn’t that the AIs can’t solve the problems. The important problem is that they sometimes claim to they have solved problems that they have not properly solved; as with hallucinations, this reflects a deeply problematic failure to sanity check their own work.

The really important challenge is not to get the AIs to solve more USAMO problems; it is to get them to say “I give up” when they can’t. And we have yet to see any evidence that any kind of prompt helps in that regard.

Ernest Davis and Gary Marcus are sorry to have to break bad news, once again.



Source link

Follow on Google News Follow on Flipboard
Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Copy Link
Previous ArticleAndrej Karpathy Discusses AI Predictions and Betting Markets | Flash News Detail
Next Article Sora Selects: Maybe I Got Carried Away by Panaviscope
Advanced AI Bot
  • Website

Related Posts

AI literacy, hallucinations, and the law: A case study

May 24, 2025

Black Mirror was a warmup act

May 23, 2025

The “AI 2027” Scenario: How realistic is it?

May 22, 2025
Leave A Reply Cancel Reply

Latest Posts

Why Hollywood Stars Make Bank On Broadway—For Producers

New contemporary art museum to open in Slovenia

Curtain Up On 85 Years Of American Ballet Theatre

Is Quiet Luxury Over? Top Designer André Fu Believes It’s Here To Stay

Latest Posts

Google Gemma 3 : Comprehensive Guide to the New AI Model Family

June 5, 2025

Mistral Rolls Out AI Coding Assistant to Challenge Claude and Copilot

June 5, 2025

DeepSeek Allegedly Use Google’s Gemini to Train Its Latest AI Model

June 5, 2025

Subscribe to News

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

Welcome to Advanced AI News—your ultimate destination for the latest advancements, insights, and breakthroughs in artificial intelligence.

At Advanced AI News, we are passionate about keeping you informed on the cutting edge of AI technology, from groundbreaking research to emerging startups, expert insights, and real-world applications. Our mission is to deliver high-quality, up-to-date, and insightful content that empowers AI enthusiasts, professionals, and businesses to stay ahead in this fast-evolving field.

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

YouTube LinkedIn
  • Home
  • About Us
  • Advertise With Us
  • Contact Us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
© 2025 advancedainews. Designed by advancedainews.

Type above and press Enter to search. Press Esc to cancel.