Measuring artificial intelligence. The easy metric that replaced the hard question

A model that solves nearly a hundred per cent of the problems in SWE-bench Verified, breaking in twelve months a barrier that had been stuck for years around sixty per cent, can't read a wall clock more than half the time. The two figures come from the same report, Stanford HAI's AI Index Report 2026. It isn't a paradox, it's a clue. The old question — what is intelligence — has been replaced by a comfortable operation: publishing rankings, avoiding defining.

The clock clue

Whoever reads those two numbers calmly soon realises something doesn't fit in how the industry measures what we call artificial intelligence. Whoever reads it without calm will keep repeating headlines with shiny percentages while the question those percentages claim to answer stays unformulated.

The question is an old one. It's had no settled answer since it was first raised. What's new is the contemporary trick. We've replaced the question with the metric. Instead of arguing about what we measure, we publish rankings. Instead of defining, we compare.

The handy thing about the operation is that it produces numbers. The empty thing is that the numbers don't mean what they appear to mean.

The instruments on the table

It's worth looking one by one at the instruments on the table, because they're the ones that appear in the official reports, in the investor presentations and in the headlines when a new model is announced.

MMLU (Massive Multitask Language Understanding, a battery gathering multiple-choice questions from fifty-seven academic subjects, from law to medicine), presented by Hendrycks and others in 2020, poses multiple-choice questions in each subject. Frontier models already surpass the average human on it and close in on the expert human. GPQA (Graduate-Level Google-Proof Q&A, a set of questions designed not to be solvable by searching Google), by Rein and others, raises the level to a doctorate in the hard sciences and there too the models beat the human experts in most subdomains.

SWE-bench Verified, along the lines proposed by Jimenez and others (2024) for SWE-bench, evaluates the ability to close real GitHub tickets — that is, programming jobs someone actually wrote to fix a real program. HumanEval measures isolated functional programming, closed exercises evaluated with automatic tests. OSWorld goes further. It throws the model at operating an operating system, opening applications, moving files, completing tasks any office worker dispatches before coffee. GAIA, proposed by Mialon and others, tries to combine reasoning, search and tools on long problems.

Each one measures something different. Each one is sold as if it measured intelligence.

Why the industry prefers them

It costs nothing to understand why the industry prefers benchmarks (standardised tests for comparing models) to anything else. They produce a number between zero and a hundred. They allow quarter-on-quarter comparison. They generate headlines. They fit on the board meeting's closing slide. And, above all, they move.

Every six months the figure rises and there's a new press conference.

The metric as a marketing product is such an efficient piece that it would be naive to think it'll disappear because someone points out the emperor has no clothes. The problem isn't that the metric exists. The problem is that it's been put in the conceptual place where the question used to be.

The clock nobody cites in the headlines

Here comes the clock. The same AI Index 2026 carries the uncomfortable figure. Frontier models read an analogue clock correctly 50.1% of the time.

Tossing a coin gives slightly worse results and is a fair bit cheaper.

A seven-year-old with a bit of patience beats these systems at a task taught in the first year of primary school. How does this picture hold up next to the other one, of the model that solves maths olympiads or passes the medical-school exam? It holds with no contradiction because the two measure different things, even though both are sold under the same word.

The asymmetry that isn't a failure

What the model does well is what it has seen millions of times during training, in similar formats, with question-and-answer structures the web reproduces to exhaustion. What the model does badly is what requires something training doesn't deliver: a spatial gaze, a continuous correspondence between two physical elements, a composition the photographic datasets (collections of labelled images used for training) don't label because it occurs to no human to label how you read a clock.

The asymmetry isn't a failure of capacity. It's the exact drawing of what the industry trains and what it doesn't. The benchmark rewards the first. The clock reveals the second.

Both are in the same report. The second doesn't make the press.

Legg and Hutter, 2007, the warning that was overlooked

Shane Legg and Marcus Hutter published in 2007, in the journal Minds & Machines, an article titled Universal Intelligence: A Definition of Machine Intelligence. Their thesis is simple and painful.

Any operational definition of intelligence is already a choice of tasks.

If you define intelligence as the ability to solve the set of tasks T, you've made two decisions in one. You've decided what counts as a task, and you've decided how much weight to give each in the final sum. The choice isn't neutral, can't be neutral, and there's no privileged set of tasks that's "intelligence in general".

What we call intelligence is the result of a cultural negotiation over which problems deserve solving. When that negotiation crystallises into MMLU or SWE-bench, what the benchmark measures is the overlap between the model's capabilities and the priorities of whoever built the benchmark. Calling it intelligence is a rhetorical shortcut that saves debate and delivers product.

The mirror of human IQ

There's a parallel that isn't being used enough. The psychometrics (the discipline that designs tests to measure mental capacities) of human IQ has been going for more than a century, has produced scales across five generations, has propped up entire school systems, and is still the object of open academic debate about what exactly it measures, how much of what it measures is cultural, how much is genetic, how much is replicable, and how much is the test's own institutional history reinforcing what the test selects.

Gould and the warning that isn't being read

Stephen Jay Gould published The Mismeasure of Man in 1981, and although his book has legitimate technical criticisms, its underlying argument hasn't been knocked down. IQ served, above all, to classify and school people en masse, not to measure the mind.

That a construct like this, after a century of discussion, remains unsettled should be an obvious warning to anyone aiming to export the same measurement model to machines in five years.

The warning isn't being read. The procedure is being repeated with speed and enthusiasm.

Where human psychometrics at least drags along the discomfort of its debate, artificial psychometrics is born without that trace, presented to the public as though the question were settled.

The reasonable objection

A reasonable objection is that the analogy doesn't quite hold, because measuring a model on a specific task at least tells you something about that task, and that's already valuable information. The objection is correct and that's why it's interesting.

The problem isn't saying "the model solves eighty per cent of the GitHub tickets in SWE-bench Verified". That sentence is exact, checkable and useful.

The problem is the ellipsis that replaces it, the one that makes the headlines. "The model matches a human programmer." The difference between the two sentences is a world. The first describes performance on a sample. The second attributes a general capacity.

The rhetorical move joining the two is made by the industry and amplified by the press, not because there's bad faith in each link but because the first sentence doesn't sell and the second does. The easy metric replaces the hard question because there's a whole market aligned to make the swap pay.

The contamination nobody names

There's also a technical detail that's rarely mentioned and worth putting in black and white. When a model is evaluated on MMLU, on SWE-bench or on HumanEval, what's being measured isn't exactly its ability to reason about that subject. What's being measured is its performance on a test of which there exist versions, commentaries, forums, papers (academic articles), repositories and discussions that almost certainly form part of the model's training corpus.

The border between what the model "knows" and what the model "has seen" is porous by construction.

The efforts to clean it up — that is, dataset decontamination (filtering from training anything resembling the test), the so-called verified versions, the hold-outs (chunks of data set aside that the model mustn't have seen) — arrive late and with uneven results. The final figure is honest within its definition and misleading outside it.

Whoever reads it as a measure of general capacity is extrapolating into territory the benchmark doesn't cover.

The field and its broken promises

Melanie Mitchell wrote in Why AI Is Harder Than We Think (arXiv 2104.12871) an academic version of what's said here in an uncomfortable tone. Her argument is that the history of artificial intelligence is paved with promises broken by a misreading of the metrics. Marcus and Davis had cleared the ground a couple of years earlier in Rebooting AI (2019): the same conceptual step repeated in each generation, with systems that win benchmarks and lose the moment the world steps outside the mould. Gary Marcus keeps documenting each new version of the phenomenon in his newsletter Marcus on AI, where week in, week out he also points to where the presentation's shiny figure doesn't survive a question from a journalist who knows the subject. Stuart Russell, in Human Compatible (2019), for his part demanded that any meaningful evaluation take into account the system's goals and not just its performance on given tasks, a piece the public conversation still hasn't taken on board. Every time a system equals or surpasses the human on a specific task, the field assumes it's close to solving the general problem, and every time it's wrong for the same reason: the specific task wasn't representative of the general problem.

The misunderstanding outside the seminar

The difference is that this time the misunderstanding doesn't happen only among academics.

It happens in parliaments that legislate on the basis of numbers they don't understand, in boardrooms that sign off investments from extrapolations, in newsrooms that headline by the flashiest figure in the press kit.

The political consequence is no small thing. The debate on regulation, on risks, on where to draw the line, operates with numbers that seem to have one meaning and have another. People argue over whether the models are already "more intelligent than humans" with the same seriousness with which they'd argue about a GDP, and the analogy is faulty, because GDP at least has an agreed definition and a calculation protocol that can be audited.

AI rankings don't have that consensus. They have versions, patches, controversies over leaks, company-self-reported numbers with no third party to validate them, and a market that feeds itself on the figure of the quarter.

What should be measured then

Whoever asks what should be measured runs into an answer no one wants to write on a slide. We don't know. The question is still open. Intelligence, in the strong sense the word drags along, is something not even well defined for humans, and exporting the incomplete definition to machines doesn't settle it but muddies it.

The honest thing would be to say, every time a number is published, that the number measures exactly model X's ability to beat benchmark Y, conditioned on model X having seen during training an indeterminate amount of material related to Y. The sentence is long and not very saleable. That's why it doesn't appear.

What appears is the percentage.

If what we measure isn't intelligence, and if the question of what intelligence is isn't going to be settled soon, the figure is left hanging on the wall, watching, with both hands in a spot the model won't be able to read.

Definitions

Benchmark. A standardised test applied to artificial intelligence models to obtain a number comparable across versions and across companies. It serves to rank, not necessarily to understand.

MMLU. Acronym for Massive Multitask Language Understanding. A battery of multiple-choice questions covering fifty-seven academic subjects, from law to medicine. It measures hits on school tests, not comprehension.

GPQA. Graduate-Level Google-Proof Q&A. A set of doctoral-level questions in the hard sciences, designed not to be solved by a direct internet search.

SWE-bench Verified. A refined version of the SWE-bench benchmark, which evaluates a model's ability to solve real tickets drawn from GitHub projects. The verified label indicates the problems have been manually reviewed.

HumanEval. A set of short, self-contained programming problems, each evaluated with automatic tests. It measures isolated functional programming, not real development.

OSWorld. A test bench that places the model in front of an operating system and asks it to perform real tasks: opening applications, moving files, completing procedures.

GAIA. A test bench proposed for general assistants. It combines reasoning, the use of external tools and search on long problems that aren't solved with a single operation.

Psychometrics. The discipline that designs and validates tests to measure human mental capacities. It's been in debate for more than a century over what exactly it measures and how much culture each test drags along.

IQ (intelligence quotient). A summary number produced by certain psychometric tests. Its correlation with intelligence, understood in the strong sense, is still a matter of open academic discussion.

Dataset contamination. A situation in which the material used to evaluate a model has slipped, partly or wholly, into the data used to train it. It inflates the results without the model having acquired the capacity it appears to have.

Hold-out. A portion of data set aside before training and reserved to evaluate the model afterwards, with the intention that the model hasn't seen it. In practice, hold-outs frequently leak.

Frontier model. An artificial intelligence system at the upper limit of published capacity at a given moment. The label is commercial as much as technical.

References

Stanford HAI, AI Index Report 2026. The source for the figures on performance in SWE-bench Verified, MMLU and the analogue-clock reading (fifty point one per cent). Available at https://hai.stanford.edu/ai-index/2026-ai-index-report.

Legg, S. and Hutter, M. (2007). Universal Intelligence: A Definition of Machine Intelligence. Published in Minds & Machines, volume 17, pages 391-444. Referenced for the argument that any operational definition of intelligence is already a choice of tasks. Preprint available at arXiv:0712.3329.

Hendrycks, D. and others (2021). Measuring Massive Multitask Language Understanding. Presented at ICLR 2021. The original work on the MMLU benchmark. Available at arXiv:2009.03300.

Jimenez, C. E. and others (2024). SWE-bench. Can Language Models Resolve Real-world Github Issues? Presented at ICLR 2024. The original work on the SWE-bench benchmark. Available at arXiv:2310.06770.

Rein, D. and others. GPQA. A Graduate-Level Google-Proof Q&A Benchmark. arXiv 2311.12022. Referenced for the doctoral-level evaluation in the hard sciences.

Mialon, G. and others. GAIA. A Benchmark for General AI Assistants. arXiv 2311.12983. Referenced among the test benches combining reasoning, search and tools.

Mitchell, M., Why AI Is Harder Than We Think. arXiv 2104.12871. Cited for its argument about the historical misreading of metrics in artificial intelligence.

Gould, S. J. (1981). The Mismeasure of Man. New York, W. W. Norton. Referenced as a classic critique of the use of human IQ, applicable by analogy to the emerging construct of "artificial IQ".

Russell, S. (2019). Human Compatible. New York, Viking. Referenced in the context of the discussion on the meaningful evaluation of artificial intelligence systems.

Marcus, G. and Davis, E. (2019). Rebooting AI. New York, Pantheon. Referenced as background to the debate on the limitations of current approaches.

Marcus, G., newsletter Marcus on AI. Available at https://garymarcus.substack.com. Referenced as a continued critical voice on the public interpretation of metrics.

Elsewhere

#benchmarks #intelligence #reasoning #papers