Researcher Unveils Intelligence Tool To Separate AI Reality From Rhetoric

As artificial intelligence systems write symphonies and robots learn to mimic human expressions, one fundamental question persists: Are these machines truly intelligent, or are they just elaborate mimics?

Google researcher François Chollet believes he has created a way to find out. His Abstraction and Reasoning Corpus, known as ARC, aims to measure real intelligence rather than trained responses, challenging the bold claims from leading AI labs.

Now, as companies like OpenAI and Google DeepMind race toward artificial general intelligence (AGI), Chollet’s test has become the center of an industry-wide debate about what genuine machine intelligence would actually look like — and whether today’s most advanced systems are anywhere close to achieving it.

A puzzle for the mind, not the memory

The ARC test features a deceptively simple design but proves extraordinarily difficult for machines to solve. Each challenge presents a series of colored grid patterns showing a before-and-after transformation. The AI must figure out the underlying rule from just a few examples, then apply that rule to a new pattern.

There are no instructions, no text prompts, and no hints. Unlike language models that can draw on vast amounts of training data, ARC provides only minimal examples, typically two or three.

“It’s like giving the AI a few math problems and asking it to invent the formula,” Chollet explains. “There’s no room for memorization or pattern-matching based on training data. You either reason it out, or you fail.”

This approach differs dramatically from standard AI benchmarks, such as MMLU (Massive Multitask Language Understanding) or Big-Bench Hard, which test how well AI systems can answer complex questions or complete difficult tasks. Chollet argues that these conventional tests often reward systems that have memorized patterns rather than genuinely understood concepts.

“Most benchmarks test the *output*, not the *process*,” he says. “ARC is about process. It’s about whether a system can learn to learn.”

In essence, ARC evaluates fluid intelligence — the ability to solve novel problems without relying on previous knowledge — rather than the crystallized intelligence that comes from accumulated information.

scientists at a computer station amid a neural AI network

Chollet vs the AGI hype machine

The ARC test represents more than just another benchmark in the crowded AI evaluation landscape. It stands as a direct challenge to claims made by leading AI developers that their systems are approaching human-like general intelligence.

Chollet has emerged as one of the industry’s most respected skeptics. In numerous publications and presentations, he argues that popular AI systems like GPT, Claude, and Gemini fundamentally lack the qualities essential for genuine intelligence.

“Language models are not agents,” he maintains. “They are not thinking in any meaningful way. They are completing a pattern. AGI will require something qualitatively different.”

He describes current AI models as “vast look-up tables” that approximate intelligence through statistical analysis of massive datasets rather than through actual understanding. In his view, true intelligence requires causal reasoning, theory development, and the ability to connect concepts across domains — capabilities that today’s most advanced systems lack.

His position has drawn both support and criticism from across the tech industry. Supporters praise his realistic assessment amid what they consider overblown claims, while critics suggest he sets an unnecessarily high bar for what constitutes “real” intelligence.

Regardless of where they stand, few AI researchers dismiss Chollet’s perspective entirely. As the creator of Keras, one of the world’s most widely used deep learning frameworks, his technical credentials give significant weight to his theoretical positions.

Who’s taking the test—and who’s winning?

Chollet released the ARC benchmark in 2019 and updated it in 2023. The collection contains 1,000 reasoning problems, divided between a public training set and a private test set designed to prevent AI systems from being specifically optimized for the challenges.

The benchmark sets a high bar. Chollet, who estimates his performance at 85%, suggests that 80% accuracy represents human-level general intelligence. So far, no AI system has managed to solve even 30% of the problems correctly, with the most advanced language models achieving less than 20%.

However, recent developments have shown promising approaches. In 2024, French AI laboratory Kyutai developed a system called Sweeper that uses structured reasoning techniques and self-refinement to tackle ARC problems. While still far from human-level performance, Sweeper represents a new category of AI that combines search algorithms, abstraction capabilities, and planning, precisely the cognitive skills ARC was designed to evaluate.

Artificial intelligence industry finds itself in the middle of fair use claims and copyright battles.

Major AI laboratories, including Anthropic, Google DeepMind, and OpenAI, have explored similar “modular” strategies. Rather than relying on a single model, these approaches employ multiple specialized components — such as image analyzers, rule identifiers, and logic processors — working together on each problem.

Chollet remains skeptical of these efforts.

“When you have 10 different specialized systems and a human engineer stringing them together, it’s not really AGI,” he points out. “It’s intelligence by committee. True general intelligence will look more like a child learning from scratch, not a team of graduate students hard-coding tricks.”

This distinction highlights the central question in AGI research: Should we measure intelligence by outcomes alone, or must we also consider how those outcomes are achieved?

Why ARC matters in 2025?

As governments worldwide develop AI regulations and companies invest billions in AGI research, the ability to accurately measure progress toward general intelligence has never been more important. Chollet’s test offers a reality check against the optimistic timelines often presented by tech companies and investors.

“We need to distinguish between systems that *perform* well and those that *understand* what they’re doing,” Chollet emphasizes. “ARC is one way to do that.”

The benchmark has quietly gained traction within the industry. Several major AI laboratories now include ARC in their internal testing procedures, though few publicly share their results. OpenAI has acknowledged that “ARC-style tasks are increasingly relevant for understanding reasoning capabilities,” while Meta AI uses ARC problems to evaluate its multi-modal systems combining vision and language.

Industry analysts suggest ARC’s influence extends beyond technical evaluation to shape how companies communicate about their AI systems. The test’s widespread recognition has made some organizations more cautious about making broad intelligence claims that their systems cannot substantiate through reasoning-based benchmarks.

“ARC has become the inconvenient yardstick that keeps the industry honest,” says Dr. Elena Martinez, an AI ethics researcher at Stanford University. “When a company claims their system shows ‘general intelligence,’ the first question many experts now ask is: ‘What’s its ARC score?'”

Not just for machines

Perhaps the most intriguing aspect of ARC is that it wasn’t designed exclusively for evaluating machines. The test challenges human problem-solving abilities as well, with some problems stumping even experienced puzzle enthusiasts while others are solved quickly by children.

This dual-purpose nature reveals something fundamental about intelligence itself. According to Chollet, genuine intelligence isn’t about data storage or processing speed, but rather the ability to identify patterns and relationships from minimal information.

“Intelligence,” he says, “isn’t just knowing things. It’s about figuring out new things.”

This perspective aligns with cognitive science research suggesting that humans excel at “one-shot learning” — the ability to grasp concepts from just one or two examples. By contrast, even the most advanced AI systems typically require thousands or millions of examples to learn similar concepts.

Micron, Oracle, and Meta spearheading AI arms race

Psychologists who have studied human performance on ARC problems note that people often solve them through analogical reasoning — recognizing similarities between the current problem and previously encountered situations — a cognitive skill that machines have struggled to replicate.

“What makes ARC fascinating is how it highlights the gap between human and machine reasoning,” explains Dr. James Wilson, cognitive scientist at MIT. “Humans can look at these patterns and make intuitive leaps that no AI can currently match, despite machines having vastly more computational power.”

The road ahead: ARC 2.0 and beyond

As of 2025, Chollet is developing an enhanced version of the ARC benchmark. This update will incorporate dynamic visual reasoning challenges, three-dimensional environments, and multi-step planning problems to more comprehensively test intelligence across different domains.

He is also collaborating with cognitive scientists to validate how well ARC correlates with human problem-solving strategies. This research aims to ensure the benchmark remains grounded in actual human cognition rather than arbitrary computational challenges.

“The goal isn’t to create an impossible test,” Chollet clarifies. “It’s to create a meaningful one — something that measures intelligence as we understand it, not just computational performance.”

Some researchers suggest that ARC’s importance may grow as AI development continues. As language models approach performance ceilings on conventional benchmarks, tests of abstract reasoning could become the new frontier for measuring progress.

For now, ARC remains the most rigorous intelligence assessment that no artificial system has mastered. But as AI architectures evolve and research teams develop new approaches to machine reasoning, the test may eventually serve as the definitive measure for determining when machines have truly achieved human-like intelligence.

Until that happens, Chollet’s benchmark stands as both a challenge to AI developers and a reminder of what makes human cognition unique — the ability to see beyond the surface, make connections across domains, and understand the world through remarkably little information.

What do you think defines true intelligence in AI systems? Have you tried solving ARC-style reasoning puzzles yourself? Share your thoughts on how we should measure artificial intelligence and whether current AI systems deserve the “intelligent” label.

Please comment below.

AI hype or true intelligence? Researcher rolls out tool to measure real capabilities