Why it makes things up, hallucination

In E001 I left a half-made promise: a chat can sound just as confident when it's right as when it's making things up. Here I stop on that failure, give it a name and, above all, turn it into a habit that saved me a lot of grief. Once I understood where it comes from, I stopped trusting the tone and started trusting the check.

This step assumes what you already saw: that the model generates the most likely thing and that its parameters are patterns, not a database of truths. Everything that comes now is built on that.

A name for a specific failure

Imagine you ask your chat about Napoleon and it answers, without batting an eye, that he was a centurion in the service of Julius Caesar. With total poise, rounded off, as if it had read it in a book. It isn't an absurd example, it's exactly the kind of thing that can happen depending on how you set up the conversation. That phenomenon is called hallucination: the model confidently asserts something false.

A quote that was never written, a book that doesn't exist, a changed date, an invented law. The form is impeccable; the content, a lie. And it's worth dropping one idea right away: this isn't a rare breakdown that happens to a faulty model. It happens to all of them, the best ones included, because it's born from the same mechanism with which they get things right. It isn't a side effect you can clean away entirely, it's the other face of how the thing works.

Where it comes from, neither magic nor breakdown

Recall E001. The model doesn't open a drawer with the correct answer: it picks the most likely chunk of text that comes next, one after another. When you ask it about something that appeared very often and very clearly in the text it was trained on, the pattern is extremely strong and it gets it right. But when you ask it about something it barely saw, or never saw, the pattern isn't there, and even so it has to keep putting down words.

So it does the only thing it knows how to do, fill in with whatever looks like it would go in that gap. If the most plausible continuation sounds good but isn't true, it writes it anyway, because it isn't measuring truth, it's measuring plausibility. This is the idea that cost me the most to swallow, so I'll say it bluntly: the model doesn't tell apart what it knows from what fits. For it they're the same operation. Inside there isn't a drawer of truths on one side and a drawer of inventions on the other, there's a single probability calculation that sometimes lands on a real fact and other times on a fabricated one.

Why it prefers to gamble over staying quiet

Here a nuance comes in that for me cleared it up, and it comes from a recent OpenAI paper titled Why language models hallucinate (Kalai, Nachum, Vempala and others, 2025). The question they ask is a good one: if the model isn't sure, why doesn't it simply say "I don't know"? And the answer points to how it's trained and tested.

During its development, the model is measured with tests that work like a multiple-choice exam. And in any old test, leaving an answer blank earns you a sure zero, while gambling on an answer gives you, at least, the chance of getting it right by luck. If the scoring system rewards guessing and punishes silence the same as error, the most rewarding thing is always to gamble. The model, which learns to maximise that score, ends up drawing the logical lesson: when in doubt, fire. That's why it sounds so confident even when it's making things up, because admitting doubt never earned it points and blurting out a plausible answer, sometimes, did.

Put another way, AI hallucination isn't only an accident of the mechanism, it's also a habit we've rewarded without meaning to. The authors themselves propose, as a fix, penalising errors said with poise more than acknowledged doubts, but as long as the exams models are measured with keep rewarding lucky guessing, models will keep guessing.

Poise is proof of nothing

And I reach the misunderstanding that cost me the most dearly, believing that if it answers confidently it's because it knows. It's tempting, because with people it usually more or less holds that whoever speaks firmly does so because they have a command of the subject. With a language model that clue is no good.

You already have the reason assembled. The model writes in the same firm tone when it recomposes a solid pattern as when it fills a gap with the first plausible thing that fits, because in both cases it does exactly the same: pick the most likely token. It carries no certainty meter inside that modulates the tone, nor a warning that italicises "I'm making this up." The confidence with which it talks to you is a feature of its style, not a sign the fact is good. Poise and accuracy are two things that, inside there, don't go hand in hand.

The rule I take from this step

From all this comes a single practical consequence, and it's the most useful one on the staircase so far: never take a quote, a specific fact or a source as good just because the chat asserts it with conviction. A name, a date, a number, a link, a book title, a clause of a law: all of that gets checked somewhere else before you use it. The firmness of the tone doesn't count as proof.

I don't say this so you'll live distrusting the tool or so you'll throw it away. I say it so you'll use it where it shines and put a net under it where it fails. For drafting, rephrasing, summarising, ordering ideas or exploring a topic, a chat is great. For asserting a fact you're going to take as certain in front of others, you need to verify it. The idea isn't "never trust it," it's "verify the verifiable," which is very different and much more bearable.

That verification has its own stretch further on, with its tricks for doing it fast and without fuss. For now healthy suspicion is enough. And a question arises on its own: what happens when you ask it about something that occurred after it finished learning? There the gap isn't that the pattern is weak, it's that it doesn't exist, and that opens the next step.

Definitions

- Hallucination: when an AI chat confidently asserts something false —a quote that doesn't exist, a wrong date, an invented fact. It isn't a rare breakdown, it's born from the same mechanism with which it also gets things right. - Plausibility: what the model actually measures. It doesn't check whether something is true, but whether it sounds like a believable continuation of the text. That's why a well-formed false fact can slip through as if it were good. - Guessing versus admitting doubt: the dilemma that training resolves in favour of guessing. Since the tests the model is measured with reward gambling on an answer and don't reward saying "I don't know," the model learns to gamble. - Verify: check a fact in a reliable source other than the chat itself before taking it as good. The basic habit that comes out of this step.

No comments yet

No comments yet. Be the first.