States of the mind. No test measures the same person twice

Take an intelligence test tomorrow at nine, after eight hours of sleep and a calm breakfast. Take the same test on Thursday at six, hungry, after a row with your partner and two cups of badly-slept coffee. Both scores would carry your name. Neither describes you. They describe a person who at that instant happened to share your ID card.

Cognitive psychology has known this for decades. Public opinion has ignored it for decades. People talk about IQ as if it were a number carved in bone, about decision-making capacity as if it were a trait, about emotional maturity as if it were measured on a scale. What gets measured is a temporal slice of a system that never stops moving. Different slices, different people, same ID card.

The body decides before you do

In 1994, Antoine Bechara and Antonio Damasio published an experiment that should have changed the public conversation about intelligence and didn't. The Iowa Gambling Task (a laboratory protocol in which the subject draws cards from four decks to win or lose fictional money) placed participants before those four decks. Two paid out high rewards but even higher penalties in the long run. The other two gave modest rewards and small penalties, and in the long run produced profit. Healthy subjects learned within a few dozen plays to avoid the bad decks. Even before they could explain why, they were already choosing well. The body got there first: the skin sweated as the hand reached toward the losing deck, before consciousness formulated the warning.

Patients with damage to the ventromedial prefrontal cortex (the region of the frontal lobe that sits over the eye sockets and connects emotion with decision) didn't learn. They knew they were losing money. If asked, they said so without hesitation. But they kept choosing the bad decks. They had the calculation intact and the lever that turns calculation into visceral aversion broken.

Damasio called it the somatic marker (the bodily signal —a pinch in the stomach, a sweat, a racing heart— that tags each option with an emotional weight before reasoning finishes articulating itself). Without that signal, the reasoning keeps running like an engine spinning in neutral. You decide, but you don't decide well. A decade later, in 2005, Bechara and Damasio himself gave the hypothesis its mature formulation applied to the economics of decision: what breaks in those patients isn't only a card game, it's the capacity to weigh contracts, offers, purchases. Visceral intuition isn't an emotional luxury; it's part of the machinery we call rationality.

There are serious critiques of the hypothesis. Maia and McClelland showed in 2004 that the subjects were more aware of the winning strategy than Damasio had assumed, and that part of the effect can be explained without recourse to the body. The technical discussion is still open. What isn't open is the clinical fact that work laid bare: there are brains that know what's good for them and still don't choose it. Abstract intelligence and sensible conduct can dissociate. Something more is needed, something that comes from the body, and that something changes every hour of the day.

Sapolsky goes through it with obsessive patience in Behave: every human decision, he says, has to be read on three scales at once. What happened a second ago in the nervous system, what happened this week in the hormones, what happened twenty years ago in development. The subject in front of you isn't a fixed point. It's the instantaneous intersection of three biological clocks that don't share a rhythm.

Judges, glucose, sentences

In 2011, Danziger, Levav and Avnaim-Pesso published in PNAS a study on Israeli parole boards. One thousand one hundred and twelve rulings, eight judges, ten months. The finding was brutal: the probability that a prisoner obtained parole fell to near zero as the session went on, and returned to sixty-five percent right after the meal breaks. Same judge, same type of file, same offense. Different glucose, different fatigue, different sentence.

The interpretation has had rebuttals. Some attribute part of the effect to the order in which cases are called, there are debatable methodological nuances. The core still stands: decision fatigue (the cognitive wear that appears after making consecutive decisions and pushes the decider toward the default option) takes a judge out of the file and leads him to automatism, which in a parole board is to deny.

This is what gets measured when one believes one is measuring a magistrate's legal judgment, a voter's moral discernment at seven, a child's reading ability at a quarter to two, a candidate's professional aptitude after three hours of interviews. What gets measured is someone who at that instant is a specific version of themselves, crossed by concrete hormones, concrete sugar levels, a concrete pending conversation. No test measures the same person twice because that person doesn't exist twice.

Psychometrics knows this and says it quietly

The manuals record it as test-retest reliability (the correlation between the results a single subject obtains on taking the same test twice in close succession). If you give the same test to the same subject in two nearby sessions, how alike are the results? The answer, for many tests considered rigorous, hovers around coefficients between 0.7 and 0.9. It sounds high. It means a non-negligible proportion of the result depends on when you happened to take it. In personality, creativity or attitude tests, the coefficients drop further. In unstructured interviews, they collapse.

Kahneman, in Thinking, Fast and Slow, a portrait of all this drift, told it with a useful metaphor: the fast system —the one that decides almost everything all day— doesn't operate with information about the problem, it operates with the most easily available information. If the most easily available is what the body feels at that moment, that's what goes into the answer. The fastest, the most at-hand, is what's said not by the problem but by the state you're in when you're asked. That's why a test doesn't return a trait, it returns an intersection between the trait and the state.

The industry that lives off evaluating people knows these numbers. The psychological practices that issue reports for courts know them. Almost nobody tells the client. A report that asserts sells better than one that qualifies that the assertion, repeated next month with the subject in another state, could shift by fifteen percent. The market of cognitive evaluation is built on faking a stability the evaluated object doesn't have.

In education the same happens. A final exam decides a yearly grade with the data of one morning. A university entrance exam decides four years with the data of a few hours. It's known that a bad breakfast shifts a few points, that test anxiety shifts more, that the time of day shifts too. People keep evaluating this way because the alternative, evaluating at many moments and averaging, is expensive and nobody funds it. The cheap wins over the truthful, and then the result is cited as if it measured the subject.

The machine that always performs the same

And then artificial intelligence arrives and everyone starts comparing. The model always gives an answer. The model answers at three in the morning the same as at eleven. It doesn't get tired. It isn't hungry. Its partner hasn't just left it. The journalistic conclusion is obvious: the machine is more reliable. The machine, they say, doesn't fluctuate. There are even those who present it as the great advantage: at last an evaluator not conditioned by glucose.

Here we have to stop. The claim is false, and the falseness is interesting because it reveals what's meant by fluctuation.

A language model fluctuates. The sampling temperature (parameter that controls how much randomness is introduced when choosing each next word: at zero the model always picks the most probable option, raising it allows less probable alternatives) changes the output. The prompt (the input instruction given to the model) changes the output. The order of the messages changes the output. The version of the model changes the output. Two identical requests, minutes apart, can return different answers, sometimes contradictory. The industry knows this and manages it with techniques of seeds (fixed values that serve to reproduce the same random result twice) and multiple evaluation. The fluctuation exists and is important. What doesn't exist is that fluctuation resembling the human kind.

Two fluctuations with the same name

When a human fluctuates, they fluctuate from hunger, fear, shame, falling in love, exhaustion. Each state carries information about the world and about oneself. Hunger says you've been concentrating for six hours. Fear says something has moved. Shame says you've crossed a line in front of someone. They aren't noise in the calculation: they're part of the calculation. They're the somatic markers in action, tagging each decision with a weight that comes from the body and not from the concept.

When a model fluctuates, it fluctuates because someone touched a parameter or because a random sampling fell one way or another. That variation encodes nothing about the world. It isn't information, it's deviation. Comparing the two fluctuations is like comparing a hand tremor from coffee with a tremor from Parkinson's: superficially they look the same, they're distinct phenomena.

Can an AI get drunk?

The question isn't a joke. It's a diagnostic test of the confusion between human fluctuation and machine fluctuation. Drunk doesn't mean fluctuating. It means with the prefrontal cortex partly sedated, inhibitory control lowered, risk valuation distorted, motor coordination impaired, working memory shrunk, the somatic marker malfunctioning. It's an integral state of the organism. It modifies, at once, decision, perception, motor control, language and emotion because all those systems share a chemical substrate.

An AI has no chemical substrate. It has no prefrontal cortex to sedate. It has no risk of its own to weigh. You can raise its sampling temperature and it will produce more erratic sentences, but that erraticness touches no valuation of danger because there's no danger of its own to value. You can instruct it to imitate a drunk and it will imitate one, but the imitation is theater, not an internal state. There's no chemistry. There are vectors, weights and an inference process that happens or doesn't happen, with no biological gradations.

Here's the difference that matters, and not in the sense you'd expect. It's not that the human is superior because they can get drunk. It's that human fluctuation, including drunkenness, including tiredness, including fear, is what enables rectification. The one who decided badly while tired remembers it the next day rested and revises. The one who said too much while drunk feels shame in the morning and modulates. The one who judged while hungry can, if they have honesty, reread their judgment well-fed and qualify it. Those states produce reflexivity. They produce backtracking. They produce the possibility of saying I was wrong because back then I was someone else.

An AI has no such lever. It gives each answer in the same state, which is no state. There's no afterward in which it returns to an answer with another mood. You can feed it a new prompt telling it it was wrong and it will rewrite, but it isn't the same. Human rectification comes from within, pushed by an internal state that wasn't there when the error was made. An AI's rectification comes from outside, because inside there's nothing pushing.

What's gained when the unstable is lost

There are tasks for which the machine's always-the-same is a clear advantage. Adding long numbers. Finding patterns in databases. Summarizing documents to a template. Tasks where human fluctuation is pure noise, where there's no information in the operator's tiredness, where you want the same procedure applied a million times. For that, machines.

There are other tasks for which the always-the-same is a defect, and here's the trap. Judging a person. Accompanying the sick. Deciding whether a relationship deserves another chance. Educating a child. Coming to terms with an adversary. They don't ask for a procedure applied identically each time. They ask for someone who can doubt, backtrack, tire, feel shame, change their mind without being asked. They ask, exactly, for what human fluctuation makes possible.

Where a human suits best is where their fluctuation isn't noise but method. Where the system needs the evaluator to be able to return to their evaluation with another face, another glucose, another level of fury or calm. The traceability of human doubt is what the machine's always-the-same can't offer. Not because it doesn't want to. Because it can't get drunk.

Before you believe it

When someone tells you an automated system is better than a human because it always performs the same, ask yourself what's being measured. If it's calculation speed and procedural consistency, they're probably right. If it's the capacity for judgment in a matter where tomorrow information may appear that changes the sense of the decision, no. They have a cheaper system, not a better one. The difference has turned invisible, and that's where you have to stop and look.

They're going to evaluate you again next week. You're going to be a slightly different person. The score will carry your name. It will describe someone else.

Definiciones

Iowa Gambling Task. An experimental task designed by Bechara, Damasio and collaborators in 1994 to study decision-making under uncertainty. The subject draws cards from four decks with different gain-and-penalty profiles, and the researchers measure both their choices and the physiological responses preceding each choice.

Somatic marker. A bodily signal —a change in sweating, in heart rate, a gut feeling— that emotionally tags the available options before conscious deliberation concludes. By Damasio's hypothesis, without that signal abstract reasoning doesn't produce sensible decisions.

Ventromedial prefrontal cortex. A region of the front part of the brain located over the eye sockets, which integrates emotional information and puts it at the service of decision. Its lesion leaves verbal calculation intact but ruins practical judgment.

Decision fatigue. Cognitive wear that appears after a succession of decisions and pushes the decider toward default responses or toward avoiding deciding.

Test-retest reliability. A statistical measure of how much the results of a single test administered to the same subject at two nearby moments coincide. If reliability is low, the test measures the moment more than the subject.

Sampling temperature. A technical parameter in language models that regulates the degree of randomness in the choice of each next word. At temperature zero the model always takes the most probable option; raising it allows less expected outputs.

Prompt. The input instruction given to a language model. The same question phrased with different prompts can produce different answers.

Seed. An initial value that fixes the sequence of random numbers a model uses, so that two runs with the same seed and the same prompt produce the same output.

Referencias

Bechara, A., Damasio, A. R., Damasio, H. and Anderson, S. W. (1994). Insensitivity to future consequences following damage to human prefrontal cortex. Cognition, 50, 7–15. Origin of the Iowa Gambling Task and of the experiment with ventromedial-prefrontal-damaged patients cited in the section on the somatic marker.

Damasio, A. R. (1994). Descartes' Error. Emotion, Reason, and the Human Brain. Putnam. The original exposition of the somatic marker hypothesis, basis of the section "The body decides before you do."

Bechara, A. and Damasio, A. R. (2005). The Somatic Marker Hypothesis. A neural theory of economic decision. Games and Economic Behavior, 52, 336–372. The mature version of the hypothesis. Available at https://web.stanford.edu/~jlmcc/papers/BecharaEtAl05_TiCS.pdf.

Maia, T. V. and McClelland, J. L. (2004). A reexamination of the evidence for the somatic marker hypothesis. PNAS, 101, 16075–16080. A methodological critique of the original Bechara and Damasio work, cited when qualifying the scope of the hypothesis.

Danziger, S., Levav, J. and Avnaim-Pesso, L. (2011). Extraneous Factors in Judicial Decisions. PNAS, 108, 6889–6892. The study on Israeli judges, source of the data cited in "Judges, glucose, sentences."

Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. The general framework on the dependence of judgment on the decider's state, present as a backdrop in the sections on psychometrics and decision fatigue.

Sapolsky, R. (2017). Behave. The Biology of Humans at Our Best and Worst. Penguin Press. A reference for the idea that biological state determines conduct, running across the article.

Para profundizar

LeDoux, J. (1996). The Emotional Brain. The Mysterious Underpinnings of Emotional Life. Simon & Schuster. The neural substrate of emotion and its effect on cognition, useful for extending the argument on internal fluctuation.

También te interesa

En otros sitios

#intelligence #benchmarks #anthropomorphism #reasoning