Measuring Intelligence with ARC

Jan 26, 2025
7 min read

In less than a month, ARC-AGI-2 will launch – and some predict o3’s score to drop below 30% (down from 88% on ARC-AGI-1). Before this happens, I wanted to better understand ARC’s origins and its role as a benchmark of general intelligence.

To do this, I went back to François Chollet’s 2019 paper, On the Measure of Intelligence. In this paper, he says intelligence isn’t about excelling at a single task; it’s about generalizing – learning new skills from minimal examples and applying those skills across unfamiliar situations. Abstraction and Reasoning Corpus (ARC) is a research tool Chollet proposes to help us think about the right problems around a system’s ability to generalize and acquire new skills, simulating human-like fluid intelligence. These problems are generally easy for humans, but have historically been very hard for AI.

My goal with this paper is to outline how ARC is different from other benchmarks, why it’s considered the best measure of artificial intelligence, as well as the challenges that led to the need for a new version (to be released next month).

Why ARC is More Than Just Another AI Test

Unlike typical AI benchmarks that measure performance on predefined tasks, ARC pushes AI systems to infer abstract rules and apply them to solve problems they’ve never encountered before. It’s not about hardcoding solutions or memorizing patterns; it’s about reasoning.

This approach mirrors early psychometric intelligence tests, such as the Binet-Simon scale developed in the early 20th century or Charles Spearman’s theory of the g factor, or general intelligence. Spearman’s theory highlighted that individuals who performed well on one type of cognitive task often excelled in others. Similarly, ARC aims to assess general problem-solving ability rather than prowess at specific tasks.

But ARC goes further than human IQ tests by stripping away language, real-world imagery, and accumulated knowledge. It focuses purely on innate reasoning skills – those that don’t rely on prior exposure to specific symbols or contexts.

The Core Ingredients of ARC’s Intelligence Test

ARC’s tasks are built around four fundamental cognitive priors – basic assumptions that humans are born with, which guide how we interpret the world:

Objects and Their Behavior Humans naturally perceive the world as being made up of distinct objects that behave in predictable ways. ARC incorporates this by requiring systems to:
- Recognize object cohesion—grouping elements based on proximity or shared characteristics (Figure 1).
  Figure 1: Left, objects defined by spatial contiguity. Right, objects defined by color continuity.
- Assume object persistence—understanding that objects don’t just vanish or appear out of thin air (Figure 2).
  Figure 2: A denoising task.
- Model object interactions—such as how objects influence each other when they come into contact. (Figures 3 and 4)
  
  Figure 3: The red object "moves" toward the blue object until "contact".
  Figure 4: A task where the implicit goal is to extrapolate a diagonal that "rebounds" upon contact with a red obstacle.
Goal-Directed Actions While ARC doesn’t explicitly include time, many tasks can be interpreted as representing the start and end states of an intentional process. This mirrors how humans often infer purpose or intent behind observed changes (Figure 5).
Figure 5: A task that combines the concepts of “line extrapolation”, “turning on obstacle”, and “efficiently reaching a goal” (the actual task has more demonstration pairs than these three).
Numbers and Counting From an early age, humans have an innate sense of numbers and quantities. ARC reflects this by including tasks that involve counting, comparing, and performing simple arithmetic operations on small numbers (Figure 6).
Figure 6: A task where the implicit goal is to count unique objects and select the object that appears the most times (the actual task has more demonstration pairs than these three).
Basic Geometry and Topology Spatial reasoning is another core human ability. ARC tasks require systems to understand geometric concepts like shapes, symmetry, scaling, containment, and connectivity (Figure 7).
Figure 7: Drawing the symmetrized version of a shape around a marker. Many tasks involve some form of symmetry.

These priors are designed to emulate the intuitive knowledge that humans use when solving abstract problems, ensuring that ARC evaluates genuine reasoning rather than rote memorization. ARC explicitly lists the priors it assumes, and avoids reliance on any information that isn’t part of these priors (e.g. acquired knowledge such as language, which is why ARC uses only visual puzzles).

How ARC Works

ARC includes 1,000 unique tasks, divided into a training set (400 tasks) and an evaluation set (600 tasks). The evaluation set is further split into public and private subsets to prevent developers from pre-training AI systems on the exact tasks being tested.

Each task consists of a few demonstration examples – pairs of input and output grids that illustrate a transformation rule. The goal is for the AI system to infer the underlying rule and apply it to new input grids to generate the correct output.

Scoring is binary: either the AI system solves a task correctly or it doesn’t. This simplicity ensures clarity but also introduces limitations, such as a lack of granularity in measuring partial understanding.

Challenges and Opportunities

ARC’s approach to testing intelligence isn’t without its hurdles:

Generalization vs. Specialization: While ARC aims to measure broad generalization, it doesn’t yet offer a quantitative way to gauge how well systems generalize across varying levels of task complexity.
Validity: Can ARC performance reliably predict real-world problem-solving ability? This remains an open question, and large-scale studies involving both humans and AI are needed to validate its effectiveness.
Dataset Size: With only 1,000 tasks, ARC’s diversity is limited. There’s a risk that AI systems might exploit unintended patterns or shortcuts rather than genuinely reasoning through problems.

Despite these challenges, ARC represents a significant step forward as a benchmark of general (artificial) intelligence. By focusing on reasoning and adaptability, it offers a fresh perspective on what it means for a machine to be intelligent.

Toward a Broader Understanding of Intelligence

One of the key takeaways from ARC is that intelligence isn’t binary – it exists on a spectrum. A truly intelligent system isn’t one that excels at a specific task but one that can learn, adapt, and generalize across a wide range of tasks. Chollet calls this “information efficiency” – how well a system uses prior knowledge and experience to acquire new skills.

Looking ahead, future iterations of ARC and similar benchmarks may incorporate additional considerations beyond information efficiency. Chollet calls out a few in particular:

Computation Efficiency of Skill Programs, which would focus on minimizing computational resource usage during inference (when using the learned skill program). This would be relevant for situations where training data is abundant, but computation during real-world application is costly.
Computation Efficiency of the Intelligent System, which would focus on reducing computational costs during training (when creating the skill program). It would be relevant for situations where training resources are limited or expensive.
Time Efficiency, which would focus on reducing the time delay in generating skill programs. This would be relevant for time-sensitive tasks where quick responses are critical.
Energy Efficiency, which would focus on minimizing energy consumption during training, execution of skill programs, or the learning process. It could be relevant for biological systems and energy-constrained environments.
Risk Efficiency, which would focus on ensuring safety when collecting experience, even if it reduces learning speed or resource efficiency. This is becoming increasingly important as AI advances towards AGI.

O3’s Leap Forward

After releasing his paper in 2019, Chollet speculated that ARC would be difficult to outperform. To put this theory to the test, he launched the first ARC-AGI competition on Kaggle in 2020. The top-performing team managed to solve 21% of the tasks in the test set. Fast forward four years, and the best score, achieved by o1 high, only increased to 32%. This limited progress on ARC-AGI highlights the overall slow pace of advancement toward AGI. From 2020 to early 2024, much of AI research focused on scaling deep learning models, improving performance on specific tasks but making little headway in solving entirely new problems without prior training data exposure – essential for general intelligence.

That changed with o3. O3 achieved an impressive 87%, a significant leap beyond anything any previous model had accomplished. While it’s not true AGI yet (o3 still struggles with some surprisingly simple tasks), it dramatically advances our quest to understand and develop general intelligence in AI.

How did it do this? While we can’t say for certain, here’s what we do know: Before o3, LLMs excelled at tasks they had been trained on but struggled with tasks requiring novelty – those they hadn’t encountered before, where solutions had to be created dynamically in real-time rather than relying on pre-trained knowledge.

O3 appears to overcome this limitation by introducing program synthesis at test time. In other words, instead of generating a single solution in one attempt, it searches for the correct sequence of steps (a "program") needed to solve the problem. O3 tries different "chains of thought" (CoTs) and evaluates them to identify the right approach. If a path fails, it backtracks and explores a new one, repeating this process until it finds a successful solution.

The model likely relies on a base LLM to propose initial reasoning paths and an evaluator model to score those paths – guiding the search by determining whether a path is promising. This iterative cycle of searching, evaluating, and backtracking allows o3 to solve tasks it has never encountered before.

ARC-AGI-2 & AGI Scaling

Despite ARC-AGI-1’s success as a benchmark, it is getting more saturated – besides o3's new score, there is also a large number of low-compute Kaggle solutions that can now score >80% on the private eval. As such, a new version – ARC-AGI-2 - which has been in the works since 2022 is set to be released in late February 2025. It promises a major reset of the state-of-the-art in order to push the boundaries of AGI research with hard, high-signal evals that highlight current AI limitations – potentially reducing o3’s score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).

Until its release, below are some other resources I recommend taking a look at.

This interview with François Chollet from the other week
ARC Prize 2024 Technical Report, which include competition results, solution approaches, and technical analysis, is here
Mikel Bober-Irizar’s thread on o3’s weaknesses vis-a-vis its performance on ARC. (Hint: Token size plays a big role in limiting its capabilities.)What does this mean for AGI scaling? The challenge isn’t just about training massive models – it’s about making them information-efficient, adaptable, and energy-aware.

If you want to talk more about this topic, please reach out via X or LinkedIn.