
OpenAI has achieved a new milestone in the race to build AI models that can reason their way through complex math problems.
On Saturday, the company announced that one of its models achieved gold medal-level performance on the International Math Olympiad (IMO), widely regarded as the most prestigious and difficult math competition in the world.
We achieved gold medal-level performance 🥇on the 2025 International Mathematical Olympiad with a general-purpose reasoning LLM!
Our model solved world-class math problems—at the level of top human contestants. A major milestone for AI and mathematics. https://t.co/u2RlFFavyT— OpenAI (@OpenAI) July 19, 2025
Critically, the winning model wasn’t designed specifically to solve IMO problems, in the way that earlier systems like DeepMind’s AlphaGo — which famously beat the world’s leading Go player in 2016 — were trained on a massive dataset within a very narrow, task-specific domain. Rather, the winner was a general-purpose reasoning model, designed to think through problems methodically using natural language.
Also: Is ChatGPT down? You’re not alone. Here’s what OpenAI is saying
“This is an LLM doing math and not a specific formal math system,” OpenAI wrote in its X post. “It’s part of our main push towards general intelligence.”
(Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems. Ziff Davis also owns DownDetector.)
Not much is known at this point about the identity of the model that was used. Alexander Wei, a researcher at OpenAI who led the IMO research, called it “an experimental reasoning LLM” in an X post, which included an illustration of a strawberry wreathed in a gold medal, suggesting it’s built atop the company’s o1 family of reasoning models, which debuted in September.
“To be clear: We’re releasing GPT-5 soon, but the model we used at IMO is a separate experimental model,” OpenAI added on X. “It uses new research techniques that will show up in future models — but we don’t plan to release a model with this level of capability for many months.”
How well did the model perform?
The IMO, which began in 1959, attracts around 50 contestants from more than 100 countries each year.
Contestants must provide proof-based responses to a total of six questions over the course of two days. Those proofs are assessed by former IMO gold medalists, with unanimous consensus required for each final score. Fewer than 9% of participants achieve gold.
According to Wei, OpenAI’s experimental model solved five out of the six problems and earned 35 out of 42 possible points (about 83%), earning a gold medal. Each proof comprised hundreds of lines of text, representing the individual steps the model took to work through its reasoning process. In keeping with the competition’s prohibition against the use of calculators or other external tools, OpenAI’s model had no access to the internet; it was purely reasoning through each of the problems step-by-step.
Also: My 8 ChatGPT Agent tests produced only 1 near-perfect result – and a lot of alternative facts
The “model thinks for a long time,” Noam Brown, another OpenAI researcher involved in the research project, wrote in an X post. “o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking.”
Analysts had previously estimated that there was only an 18% chance that an AI system would win gold in the IMO by 2025, according to OpenAI.
The big picture
For all of its impressive abilities, AI has long struggled with simple arithmetic and basic math word problems — tasks that one might think should be relatively straightforward for advanced algorithms. But unlike more narrow logical puzzles, math requires a level of abstract reasoning and conceptual juggling that has been beyond the reach of most AI systems.
That’s been changing, however, at an extraordinarily rapid pace. A little over a year ago, AI models were still being assessed using grade school-level math benchmarks like the GSM8K. Reasoning models like o1 and DeepSeek’s R1 quickly excelled, first acing high school-level benchmarks like AIME and then advancing to the university level and beyond.
A capacity for high-level mathematics has become the gold standard for reasoning models, since even a small amount of hallucination or corner-cutting can very quickly and clearly ruin a model’s output. It’s easier to get away with when generating other kinds of responses, for example, providing help with a written essay, since they’re very often open to various kinds of interpretation.
Also: 5 tips for building foundation models for AI
OpenAI’s IMO gold medal shows that a scalable, general-purpose reasoning approach can surpass domain-specific models in tasks that have long been believed to be beyond the reach of current AI systems. As it turns out, you don’t need to build hyperfocused, AlphaGo-like models trained to do nothing but math; it’s enough to train them to parse language and carefully reason through their thought process, and if they’re given enough time, they’ll be able to build AI systems that are able to compete on par with world-class human mathematicians.
According to Brown, the current pace of innovation happening throughout the AI industry suggests that its mathematical and reasoning prowess will only grow from here. “I fully expect the trend to continue,” he wrote on X. “Importantly, I think we’re close to AI substantially contributing to scientific discovery.”
Want more stories about AI? Sign up for Innovation, our weekly newsletter.