You know two people. One earned top marks in high school and finished their degree with no failed subjects. The other never finished middle school and has spent forty years in a remote village up north. One February night, with the snow cutting off the road, their cow has a difficult calving. The vet is two hours away. Guess which of the two gets the calf out alive. And guess now which of the two more closely resembles, in how it works, a state-of-the-art artificial intelligence model.
The farmer reaches in, feels around, turns, waits, turns again. The calf comes out alive. And the cow.
The honors student, in the same situation, would call someone. If there were no coverage, they'd cry. If on top of that they knew the theory of bovine calving from having read it somewhere, they'd try it with a mental scheme that would fall apart the moment the animal did something unforeseen. It's not a cruel thought experiment. It's a case anyone with rural family has seen. And it raises an old question that now stings more. What do we mean by intelligence when two people give such different results the moment the world leaves the script.
The technical problem with no metaphor
That same question, framed in machine-learning terms, has had a name since 2021. A Stanford team published WILDS: A Benchmark of in-the-Wild Distribution Shifts. The idea was simple and fairly humbling. Take state-of-the-art models, trained on canonical datasets, and feed them data of the same kind but collected under slightly shifted conditions. Photos taken in different hospitals. Satellite images of regions not in the training set. Clinical texts from new centers. Plankton identification with cameras from another manufacturer. Animal recognition in camera traps from unsampled jungles.
The models suffered substantial performance drops, often of tens of accuracy points, with gaps that varied a great deal across datasets.
They didn't fail at new tasks. They failed at the same task with slightly shifted data. The distribution a model is tested on and the distribution it's trained on almost never coincide in the real world. When they do coincide, benchmarks (standardized tests to measure performance) measure something close to intelligence. When they don't, they measure the luck of the exam resembling the syllabus.
The technical name of the collapse
The problem is known as out-of-distribution generalization (that is: how a model behaves when the data it sees in operation is no longer of the same kind as the data it was trained on). The industry has spent years polishing it. DomainBed, Hendrycks, OpenOOD, Geirhos. Each paper comes to say the same thing from a different angle. The models don't generalize, they take shortcuts. They learn the shortest statistical shortcut between input and label, and that shortcut is almost never the causal rule of the phenomenon. If a cow always appears photographed in a meadow, the model learns "green meadow equals cow." When you put a cow on a beach, it no longer recognizes it. The beach is the limit condition. The meadow is the comfort zone.
Geirhos and company called it shortcut learning in 2020. The definition is brutal because it leaves half the field naked. A model that performs at ninety-seven percent inside the syllabus and at thirty percent outside hasn't learned the task. It has learned to pass the exam. The education system has spent centuries producing the same in humans and nobody is surprised. When it happens in a neural network, academic debate erupts. Marcus and Davis warned back in Rebooting AI (2019) that performing well on well-chosen benchmarks wasn't the same as understanding the task, and that confusing the one with the other was going to come out expensive. The warning has aged well.
The asymmetry worth looking at head-on
Now go back to the farmer and the honors student. The temptation is to read the scene as an anti-intellectual plea, one of those vindications of folk knowledge against the ivory tower. That's not where it goes.
The farmer isn't superior. The farmer is robust in a narrow, heavily-trained distribution. Put that same farmer to programming a macro in Excel and he's left as blocked as the honors student before the calving. It's not that he has another intelligence. It's that his training included thousands of hours of handling animal bodies, mud, cold, objects that don't respond as expected. The generalization came with the hours.
The honors student is also trained. Their training was fifteen years of exams with well-written prompts, problems that have a solution in the expected format and teachers who graded by criteria announced in advance. Inside the syllabus, they perform marvelously. Outside, they cave. Just like a model. The difference between the farmer and the honors student isn't one of intelligence, it's one of training distribution. Academic intelligence and situated intelligence aren't two distinct faculties. They're two ways of having spent hours and experience.
If this feels comfortable, it's because you haven't yet reached the bottom. The bottom is that many modern professional posts are exams stretched over time. Spending twenty years in an office doing defined processes, with meetings whose format repeats and clients drawn from a bounded repertoire, isn't training for limit conditions. It's training for a distribution. The day that distribution shifts, and it shifts ever sooner, the professional with an impeccable résumé finds that their robustness was an illusion sustained by the stability of the environment. Many people discovered this in March 2020 and most have already forgotten it.
The uncomfortable part
Time to ask what the blog promises not to dodge. If humans generalize better than machines, it's worth specifying where, exactly.
The comfortable answer says a child learns to recognize a cat with three examples and a model needs a million. The less comfortable answer requires looking at the data.
Stanford's AI Index Report of 2026 carries a figure worth keeping in mind. Robots that hover around ninety percent success in simulated environments succeed at around twelve percent of real domestic tasks. It sounds devastating for the machine until you remember that many urban humans would also fail if you dropped them into slaughtering a pig, shearing a sheep or repairing the engine of an '84 Lada with what's in the shed. What we do well and what we do badly depends, in either case, on what training we had. Robots fail at physical tasks because their training distribution is poor in real physics. Office workers fail at physical tasks for exactly the same reason.
Where we do still generalize better than a model is in transfer from very few examples. A child can see two foxes in cartoons and recognize a real fox in the woods, even though the cartoon was garish orange and the fox is reddish brown. This the models do worse. But the gap has been closing. Recent multimodal models do few-shot learning (the capacity to generalize from a handful of cases) on visual concepts that five years ago would have been unreachable. The asymmetry exists. What no longer exists is the confidence that it'll keep existing just as pronounced within three years.
Cases where the distribution breaks
Autonomous driving. A car crossed by a circumstance that wasn't in the data. A plastic bag crossing the road. An electric scooter lying in the middle of the lane. A truck driving in reverse down a highway. Current commercial systems handle the frequent distribution reasonably well and worse the rarer the situation. Humans too, with the difference that their capacity to improvise with the wheel in hand rests on a causal model of the physical world that the car hasn't yet fully integrated.
Medical diagnosis trained on populations of one country, applied to populations of another. A dermatology model trained on light skin fails on dark skin in a known and published way. A cardiology model trained on one age range fails in another. It's not a moral failing of the model. It's that the model learned the distribution of its training, not the disease. The human doctor, trained on the same distribution, fails similarly if they've never seen the other. The difference is that the doctor, when failing, usually knows they're failing. The model has no such internal warning. The patient pays the difference.
Machine translation of local slang. A system trained on a formal Castilian corpus sinks with the speech of a Galician valley or a Caracas neighborhood. Again the same structure. Inside the syllabus it performs, outside it caves. A human translator native to the valley performs much better in their slang because they're inside their distribution, and worse than the model in slangs they don't handle. There's no comparable level of general intelligence. There are different coverages of the world.
What's left when the difference narrows
The comforting idea until recently was that AI could perform very well inside its syllabus but that robustness outside was a human patrimony. That idea holds up ever worse. The latest-generation models generalize better than those of three years ago. And humans, seen coldly, generalize worse than they like to believe. The difference has narrowed on both sides at once. They rise, we fall off the pedestal.
If what distinguishes us is no longer robustness outside the distribution, we have to go looking for the distinctive trait somewhere else.
And the serious candidates are older than is admitted in current-affairs debates. The body. Hunger. The fear of dying. The motivation that comes from having something to lose. A farmer reaches into the cow at three in the morning because if he doesn't he loses two animals and won't make it to March. A model loses nothing if it fails. It has no March. It has no cow. It has no village.
That asymmetry isn't secondary. It's probably the real source of human generalization in extreme conditions. We generalize better where we stake the body because evolution, over several million years, selected brains that generalized well when the distribution shifted abruptly. A mammoth where before there were bison. A winter where before there was autumn. A hostile tribe where before there were allies. The selective pressure wasn't in passing the syllabus exam, it was in surviving the unannounced one.
Borrowed robustness and one's own
What this suggests is something the discourse on AI hasn't yet digested. Robust intelligence may not be exactly a cognitive phenomenon. It may be a byproduct of having a finite body that dies if it doesn't generalize well. The models don't have that body. Until they do, their robustness will remain borrowed, dependent on the data we feed them. Ours, while we have it, will be our own. While we have it.
The border has moved. Where the line used to be drawn between the human who generalizes and the machine that memorizes, it's now drawn between two training systems, one biological and one digital, with different robustness properties and ever less distant. The honors student and the farmer remain in their roles. But the model approaches the farmer faster than the honors student approaches the model.
Look at your own training. Count how many hours you've spent inside the comfortable distribution and how many outside. Reckon without cheating what would happen if tomorrow the road were cut off by snow and you had to reach in. The answer doesn't fully describe you. It describes the version of you your training has produced. If you don't like it, you know what's missing. It's not more intelligence. It's more distribution. And the question returns, now with the alibi withdrawn. When your limit condition arrives, what of what you think you know survives the shift, and what falls the way a model's accuracy points fall when they've changed the test set?
Definiciones
Out-of-distribution generalization. The capacity of a system, biological or artificial, to maintain its performance when the data it encounters in operation comes from a statistical distribution distinct from the one it was trained on. It's the central problem of robustness in machine learning and, by analogy, also of human intelligence when the context changes.
Benchmark. A standardized set of tests used to compare the performance of different models on the same task. Good performance on a benchmark doesn't guarantee good performance in the real world if the benchmark conditions don't reflect the real variety of the data.
Shortcut learning. The pattern by which a model, instead of learning the causal rule of a task, learns surface correlations present in the training data that predict the correct answer more cheaply. It works perfectly inside the syllabus and sinks outside.
Few-shot learning. The capacity to generalize to a new concept from a small number of examples, rather than needing thousands. Traditionally a strength of human cognition against classic deep-learning models, today a frontier the multimodal models are eating into.
Referencias
Koh, P. W. et al., WILDS: A Benchmark of in-the-Wild Distribution Shifts, ICML 2021. The article's main source of the technical framework and origin of the finding that models suffer substantial performance drops when the data domain changes, with gaps that vary by the evaluated set. arXiv:2012.07421.
Hendrycks, D. et al., The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization, ICCV 2021. A reference cited in the body when enumerating the line of critical work on out-of-distribution robustness. arXiv:2006.16241.
Geirhos, R. et al., Shortcut Learning in Deep Neural Networks, Nature Machine Intelligence 2 (2020), 665–673. Source of the notion of shortcut learning used in the first block of argument. arXiv:2004.07780.
Gulrajani, I. and Lopez-Paz, D., In Search of Lost Domain Generalization, ICLR 2021. Origin of the DomainBed benchmark mentioned when enumerating the line of research on cross-domain generalization. arXiv:2007.01434.
Stanford HAI, AI Index Report 2026. Source of the cited figure on robots' success rate at real domestic tasks against their performance on digital tests. hai.stanford.edu/ai-index/2026-ai-index-report.
Marcus, G. and Davis, E., Rebooting AI, Pantheon, 2019. The underlying critique of the confusion between in-distribution performance and genuine understanding that appears as a backdrop to the argument on shortcuts.
Para profundizar
Russell, S. (2019). Human Compatible. Artificial Intelligence and the Problem of Control. Viking. A framework on robustness and alignment, useful for framing the problem of shortcuts and out-of-distribution generalization.
Marr, D. (1982). Vision. A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman. A classic study on the architecture of the human visual system and on why it generalizes where artificial models fail.
También te interesa
- The problem of defining intelligence
- Recognition as the basis of everything. What happens before you know you've thought
- States of the mind. No test measures the same person twice
No comments yet
No comments yet. Be the first.
Leave a comment