One of the best ways to evaluate an AI model is to put it to the test on problems that stymie skilled or experienced humans.
We see this with math data sets made for Ph.D.s, and other data sets showing off a model’s reasoning capabilities. But there’s another way to check progress with LLMs as well.
It’s called ARC AGI, and it’s been around since 2019, when Francois Chollet came up with ARC AGI or Abstraction and Reasoning Corpus for Artificial General Intelligence. The ARC AGI benchmark measures general (“fluid”) intelligence—the ability to reason, adapt, and solve novel problems efficiently—rather than memorizing or using domain-specific knowledge. It’s attributed to Chollet in his 2019 paper “On the Measure of Intelligence,” and since then, it’s been a gold standard for figuring out how good machines are at solving abstract problems.
There are also further iterations of ARC AGI still in the works: after releasing ARC AGI 2 earlier this year, Chollet and team are now working on a new set of tasks called ARC AGI 3.
The ARC AGI 3 set is different from what came before. It consists of small games, without instructions.
The human or AI agents who use this set are supposed to intuit what they are required to do by the game, just from playing around with it visually. You can try some of these little cryptic games online at the web site, and sure enough, although there are difficulties, with a little work, you should be able to figure out what the game wants you to do.
Human head with a light bulb. Concept of creative thinking, idea, innovation, solution.
getty
Metrics with OpenAI Models
Here’s the kicker: the new OpenAI model, o3, has scored very high on the ARC AGI 1 set. But it has not scored high on the ARC AGI 2 set, where it has as of yet only accomplished a 3%, against something like 85% on set 1. That higher mark is widely seen as another step toward artificial general intelligence or AGI. But it’s confusing. The o3 model did not score 85% on ARC AGI 2, and hasn’t even been tested on set 3, which is still in development.
That’s important to know, because when the AI tests high on ARC AGI 3, we’re much further toward the singularity, or AGI.
Chollet Weighs In
In a explainer video totaling some 35 minutes, Chollet takes the stage to talk about his creation of the original ARC AGI set, what’s happened since then, and a lot of the theory behind the test set, as well as overall context. There’s far too much in his talk for me to go over, but for example, Chollet goes over the difference between two definitions of artificial intelligence: one attributed to Marvin Minsky (which I talk about a lot) that characterizes AI as the mimicry of human brains, and another one espoused by someone named John McCarthy, arguing that AI progress essentially involves machines being able to adapt to new realities and previously unknown tasks.
“There’s a big difference between memorized skills, which are static and task specific, and fluid general intelligence, the ability to understand something you’ve never seen before, on the fly,” he says.
He also references the techniques used by these new models:
“In 2025, we have suddenly moved on from the pre-training scaling part,” Chollet adds. “I mean, we’re now fully in the era of test adaptation. The test adaptation is all about the ability of the model to modify its own behavior, dynamically, based on the specific data it encounters during inference. That covers techniques like test time training, program synthesis, chain of thought synthesis, where the model tries to reprogram itself for the task at hand. And today, every single AI approach that performs well on arc is using one of these techniques.”
In a metaphor approach, Chollet asks us to imagine two things: a network of roads, and a road-building company. If you have a road network, he points out, you can go various places. If you have a road building company, you can create new roads, new routes to go from point A, to point B, and so on.
“Don’t confuse the roads, and the process that created the roads,” he says.
The Agentic Contest
As with prior benchmarks, developers will be bringing new approaches to the table, hoping to score well on challenges like ARC AGI 3. ARC AGI also gives out prizes for competitive results. That’s part of the attention that this corner of the internet gets, as those close to the process try to achieve better scores with a particular AI engine.
In that context, I think we should realize what we’re looking at here. For anyone who is concerned about AGI, ARC gives us a way to test, to see how far we are on this journey. Perhaps by 2027, models will be able to score highly on set 3, or perhaps not. Everything is happening very, very quickly. Stay tuned.