Recognition as the basis of everything. What happens before you know you've thought

Before you think, you recognize. The sentence sounds obvious until you put it to work. Most of what we call intelligence happens on top of a silent layer that neither public discourse nor AI marketing wants to name with precision. And that silent layer is, in essence, the same operation in your head as in a neural network. What changes is what's on top.

When you walk into a room, you don't reason that the figure at the back is your mother. You recognize her before you can formulate a single word about it. When you see a half-black pear on the table, you don't calculate the degree of oxidation. You recognize it as a pear, and recognize it as a pear-no-longer-to-be-eaten. When a crack sounds behind you, you don't deliberate. The body turns before language has had time to assemble a hypothesis.

All later thought is built on top of that layer. Without prior recognition there's nothing to reason about, because reason about what. The cognitive substrate isn't language or logic or working memory. It's that instantaneous, probabilistic identification that happens before you believe you've registered anything.

What Marr made clear in 1982

David Marr died in 1980, at thirty-five, leaving a half-finished book that was published posthumously in 1982 under the title Vision. He turned the study of visual perception into a serious computational discipline. Marr proposed that any system that recognizes something has to be analyzed on three distinct levels, and that confusing them is the most expensive error you can make if you want to understand how it works.

The computational level asks what problem the system is solving and why. Recognizing a face, for instance, is a problem with hard constraints: the face may be backlit, turned, aged, partly covered, and even so the system must say "yes, it's that person." The algorithmic level asks how it's solved, which intermediate representations are used, which steps are followed. The physical level asks on what support the algorithm runs. Neurons, transistors, whatever.

What was important about Marr wasn't inventing the three levels. It was forcing those who worked on perception not to answer the wrong question. Neurobiologists answered physical descriptions to computational questions, and psychologists answered algorithmic descriptions to physical questions. Marr cut the knot. If you want to understand recognition, separate the levels.

That discipline is still useful today. The general press mixes the three levels every time it talks about artificial intelligence. To say a neural network "works like a brain" is to confuse the physical level with the algorithmic level. To say it "understands" because it gets it right is to confuse the algorithm with the problem. Marr's error is alive, only now it's committed with deep networks instead of visual cortex.

Biederman and the geons

Five years after Vision, Irving Biederman published in Psychological Review an article titled "Recognition-by-Components." The thesis was unsettling in its simplicity. We recognize objects by decomposing them into a few elementary volumetric shapes, called geons (from "geometric ions": basic pieces with which any recognizable object is built), and reassembling them in our heads.

A geon is a cylinder, a wedge, a block, a cone, a torus. Biederman proposed an alphabet of about thirty-six, derived from invariant properties that projective geometry preserves when an object changes angle or lighting. If an edge is straight, it stays straight from almost any viewpoint. If a surface is flat, it stays flat. If two volumes are in contact at a joint, that joint is informative. With that alphabet and a few assembly rules, by Biederman's account, you could recognize a plane, a cup, a dog or a crane.

The theory doesn't fully hold up. Forty years of later research have qualified it to the point of leaving it incomplete for many tasks, above all for face recognition, which is clearly something else. Tarr and Vuong, in a classic review of the state of the art after Marr and Biederman, were the first to seriously order the cracks in the theory: recognition depends on viewpoint more than the geon alphabet allowed, and faces live in a separate module. But the model left two ideas that haven't budged. The first, that recognizing is decomposing and reassembling, not comparing whole images. The second, that the features a recognition system uses are edges, angles, contacts, groupings, spatial frequencies. Not concepts. Not intentions. Features.

Anyone who has looked inside the early layers of a convolutional network (a type of artificial neural network that processes images by applying local filters across the visual field) recognizes the landscape. Detectors of edges, of orientations, of color blobs, of local contrasts. Higher up, combinations that resemble rudimentary geons suspiciously closely. The network didn't copy Biederman. It reached the same point because the computational problem has a geometry that pushes almost any solution toward the same primitives. LeCun, Bengio and Hinton documented in Nature in 2015 that emergent hierarchy of features, now turned commonplace: each deeper layer combines the previous one, and at the end of the journey prototypes resembling objects appear without anyone having programmed them by hand. Hawkins, in On Intelligence, had anticipated the principle a decade earlier from neuroscience: a cortex that predicts hierarchically, layer by layer, without distinguishing too much between perception and prediction. The one who opened the idea was a neuroscientist, not an engineer. The engineering came after.

Mechanism the same, substrate distinct

That's the uncomfortable moment. If the features that count are edges, angles and groupings both in the primary visual cortex and in the early layers of a network, the difference between human recognition and machine recognition isn't one of mechanism. It's one of substrate. The flesh processes at low speed, in massive parallel, with thermal and chemical noise. The silicon processes at high speed, also in massive parallel, with different quantized noise. Both systems extract statistics from the input (the input data arriving via the senses or the sensors) and compress it into representations that admit classification.

We've spent decades avoiding saying it that way. The metaphor of the mind as a computer became popular in the sixties, was discarded in the nineties as too poor and came back disguised in the two-thousands. The problem is that, badly used, it hid two things at once. It hid that the basis of recognition is genuinely the same in mind and machine, because it seemed to demote the human. And it hid that what differentiates the human isn't in the recognition but in what's mounted on top, because it required saying what exactly that is.

Forty years of public discourse have preferred to keep the upper zone vague. The human "understands," "has sense," "lives." It looks good. It serves for inauguration speeches. It serves to explain nothing.

The probabilistic part you don't want to recognize

If you believe your recognition is deterministic, you haven't paid attention to it. You cross the street, see someone from behind, wave thinking it's your cousin. They turn and it's a stranger. You apologize. The scene is banal and that's why it's perfect. Your system has just worked exactly like a badly calibrated neural network. Back, haircut, gait, brown jacket. Your brain estimated a high probability and turned it into subjective certainty. It erred. Not because defective, but because that's how the system works.

The same happens with hearing. You believe you hear a threat and it turns out it was the neighbor's air conditioning. The same with language. You read irony where there was none, or miss it where there was. The same with faces. You confuse two office colleagues the first three days, until the system accumulates enough examples to discriminate. Your recognition is statistical to the marrow, and it operates by prior, evidence and likelihood, exactly like a Bayesian classifier (an algorithm that combines what it knew before with new evidence to decide what's most probable). The difference is that the brain corrects the error with more fluency and at greater affective cost. The nature of the operation is the same.

There's an episode that illustrates this without need of a laboratory. Pareidolia. The face in the toast. The Virgin in the damp stain. Your facial system has a deliberately low threshold, because evolutionarily it's cheaper to detect faces where there are none than to miss them where there are. The price is false positives. A neural network trained with the same cost asymmetry does the same. Current generative models produce false positives all the time and the press calls them "hallucinations," an unfortunate word that hides what's happening. The system is recognizing according to the best available candidate in its prior, with or without sufficient evidence. The same your brain does at four in the morning when you believe you see a silhouette at the bedroom door.

What's left on top of the substrate

If the base is the same, what distinguishes the human must be on top. But "on top" is a lazy word. We have to be more concrete. The layer current AI doesn't have, or has very badly, isn't a single layer. It's at least three different things, and it's worth not mixing them.

First, embodied context. Human recognition never happens in the abstract. It happens in a body that's hungry, hot, afraid, needing to pee, with a mortgage, a recent disappointment and a grandmother who's dying. Those states affect the prior. You recognize faster what it suits you to recognize and filter what it suits you to filter. Damasio insisted a lot on this in Descartes' Error. The body isn't noise on top of cognition. It's the machinery that decides what counts as relevant. A neural network with no body recognizes the face but has no reason for the face to matter to it more than the background.

Second, intention. When you recognize something, that recognition enters a flow where you're doing or wanting to do something. You see the pear and you want to eat it, or throw it out, or paint it in oils. The recognition couples to a plan. Current AI recognizes without coupling to anything, except the next label a human has asked it to produce. Intention isn't a metaphysical whim. It's the difference between a system that classifies and one that acts with a sense of its own.

Third, causal understanding. When you recognize a cow, you implicitly know it has a digestive apparatus inside, that it can kick, that it was raised, that it's going to die, that it gives milk that goes into a carton sold in a supermarket where you shop. That whole causal network travels with the recognition even if you don't think it in the moment. Hofstadter, and several philosophers before him, pointed it out. Recognizing isn't labeling. It's entering a network of implications that updates with each perceptual act. Current networks have pieces of that causal network in their weights, but not stably, not accessible to reasoning and, above all, not anchored in real consequences for the system. Marcus and Davis have spent years repeating it in Rebooting AI without anyone quite listening: classifying well isn't understanding, and confusing the one with the other is the expensive error the industry keeps charging the end user.

What's left when the three layers are named

Those three things, embodied context, intention and causal understanding, are what's missing on top. That's why a current AI seems intelligent when it recognizes well and fails spectacularly when it has to decide what to do with what's recognized. The lower layer is fine. The upper one is empty or decorated with simulations of the lower one, which is what language models (statistical systems that predict the next word of a text based on patterns learned from large amounts of previous text) do when they produce convincing explanations without having understood anything.

Why understanding this changes how you read both

When you admit that human recognition and machine recognition are operations of the same kind, two cheap surprises fall to the floor. You stop being surprised that AI recognizes your face in a blurry photo, because that's what its mechanism is made to do and it does it with less noise than you in many cases. And you stop being surprised that it fails at inferring what the other is thinking, at adjusting its answer to your mood, at detecting that it's talking with someone drunk at three in the morning. Those are the upper layers, and it doesn't have them.

The people who sell AI insist on talking about the base as if it were the hard part. The hard part is already done. The base has been reasonably well solved for decades, and better every year. What's left to do, embodied context, intention of its own, anchored causal understanding, is exactly the block public discourse has been avoiding naming precisely because it doesn't sell well in a three-minute presentation. It sells better to say the next version will reason better than to sell that the next version still recognizes well but doesn't know what to do with what it recognizes.

And the people who defend human exceptionalism insist on talking about the base as if it were the place where our advantage lies. It isn't. Our advantage, while it lasts, is elsewhere. You recognize like a probabilistic machine, with your false positives and your pareidolia. What makes you different is that that recognition reaches a body with something at stake. When you stop having something at stake, or when other systems acquire an equivalent body with equivalent consequences, the border will move again and you'll have to ask yourself once more where exactly you were.

Look around you now. Recognize the first object you see. You recognized it before reading the end of this sentence. Ask yourself what happened in that instant. If the honest answer is "I don't know," you're closer to understanding the problem than most of those who write about intelligence.

Definiciones

Geon. An elementary volumetric shape proposed by Biederman as a basic piece of object recognition. Cylinders, wedges, blocks, cones and the like. The idea is that any recognizable object decomposes into a small combination of geons and the relations among them.

Cognitive substrate. The layer of perceptual processing prior to articulated thought, where stimuli are identified before language or explicit reasoning intervenes. In the article it's used to name the base common to humans and to artificial recognition systems.

Convolutional network. A type of artificial neural network designed to process images. It applies repeated local filters across the visual field and builds increasingly abstract representations in successive layers. It's the architecture that has dominated artificial visual recognition since 2012.

Bayesian classifier. An algorithm that estimates the probability of a hypothesis by combining a prior belief (the prior) with the available evidence. It serves here as a formal model of what the brain does when it recognizes with incomplete information.

Causal understanding. The capacity of a system to represent not only what an object is but what it does, what's done to it, what chains of consequences follow from its presence. It's one of the three components that, by the article's account, current AI doesn't have stably.

Language model. A statistical system that predicts the next word of a text from patterns learned in large amounts of previous text. Current language models produce plausible responses without any need for real understanding of the content.

Referencias

Marr, D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman. The foundational framework of the three levels of cognitive analysis discussed in the article's first block.

Biederman, I. (1987). Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review 94, 115-147. PDF available at https://people.csail.mit.edu/torralba/courses/6.870/papers/Biederman_RBC_1987.pdf. Origin of the geon theory, the axis of the second block.

Tarr, M. J. and Vuong, Q. C. (2002). Visual Object Recognition. PDF at https://www.staff.ncl.ac.uk/q.c.vuong/pdfs/TarrVuong2002.pdf. A gathering of the state of the art after Marr and Biederman, used to point out the nuances and limitations of the original theory.

LeCun, Y., Bengio, Y. and Hinton, G. (2015). Deep Learning. Nature 521, 436-444. A reference for the continuity between features extracted by deep networks and the perceptual primitives described by Biederman.

Hofstadter, D. (1979). Gödel, Escher, Bach. Basic Books. The backdrop to the argument that recognizing isn't labeling but entering a network of implications.

Hawkins, J. (2004). On Intelligence. Times Books. A model of hierarchical predictive recognition consistent with the idea of a common perceptual substrate.

Damasio, A. (1994). Descartes' Error. Putnam. Cited directly in the block on embodied context to hold that the body decides what counts as relevant in recognition.

Marcus, G. and Davis, E. (2019). Rebooting AI. Pantheon. A critique of the overlap between recognition and understanding that appears as a backdrop in the article's last block.

También te interesa

En otros sitios

#intelligence #anthropomorphism #hallucinations #papers #reasoning

Recognition as the basis of everything. What happens before you know you've thought

What Marr made clear in 1982

Biederman and the geons

Mechanism the same, substrate distinct

The probabilistic part you don't want to recognize

What's left on top of the substrate

What's left when the three layers are named

Why understanding this changes how you read both

Definiciones

Referencias

También te interesa

En otros sitios

Related — Mind

The problem of defining intelligence. If we don't know what it is, what are we calling artificial intelligence?

Thought and language. Speaking well isn't the same as thinking well

RAM memory vs human memory. Two different things with the same word

No comments yet

Leave a comment