What an LLM is and its parameters

In the previous step I left the chat building its answer chunk by chunk, picking the most likely word each time. I called that engine "the model" and left it at that. Now I'll give it a name and look a little inside, just enough to understand what its knowledge is made of. Not a single formula needed.

The engine has a name

That thing predicting the next word is an LLM, short for large language model. And all three words count. It's language because it works with text; it's large for a reason we'll see in a moment; and it's a model in the sense that it isn't a database or a search engine but a program that has read enormous amounts of text and, from reading it so much, has learned to continue it.

When I realised that behind any chat —whatever the brand— there's always one of these things, several mysteries dissolved at once. I'm not talking to a website or to a brain: I'm talking to an LLM. What changes from one chat to another is, above all, which LLM they've got fitted and how it's been tuned.

Millions of tiny knobs

And where is what the model "knows"? In its parameters. The word is intimidating, but the idea is simple: picture millions of minuscule knobs, like the tuning pegs on a guitar. During training, the model reads and gets its predictions wrong; each time it fails, each knob is nudged a little so it gets closer next time. Repeated a staggering number of times, those knobs end up in a particular position. That position, once set, is everything the model knows.

There's nothing else. There's no folder of data alongside it: the knowledge is the position of the knobs. That's why we say it knows things without having any of them written down, just as we saw in the previous step.

If you peek at how those knobs are stored inside, you bump into the word tensor. Another intimidating name for something modest: a tensor is a table of numbers with several dimensions. A list of numbers is one dimension; a spreadsheet, with its rows and columns, is two; stack several sheets and you have three, and so on. The parameters live arranged in those tables, and all the model's work boils down to multiplying and adding tensors at great speed. You don't need to know how to do those sums; it's enough to know that inside there's no magic, there's arithmetic repeated a great many times.

2017, the year almost everything hangs from

These models didn't come out of nowhere. The leap arrived in 2017, when a team at Google published a paper with a title that was already a statement: Attention Is All You Need. In it they presented a new way of organising the model, the transformer architecture, which is the one almost every LLM still uses today, including the one in whatever chat you have open.

What was new about it? Before, models read text in a row, word by word and in order, which made them slow and forgetful with long sentences. The transformer gave the model the ability to look at all the words in a text at once and decide which ones to pay attention to in order to understand each one —hence attention. That allowed training on far more text and capturing relationships between words far apart from each other. Almost everything that came after, the boom in the chats you know, hangs from that 2017 idea.

What "70B" means

With this you understand a label that shows up everywhere. When you read that a model is "7B" or "70B," that B stands for billions. It's the parameters: 70B means seventy billion knobs. It's simply a measure of the model's size. Now you know why they're called large.

And here it's worth dismantling the most widespread misunderstanding: more parameters doesn't automatically mean "smarter." Having more knobs does help capture finer patterns, yes, but size is only one piece. The quality of the text it was trained on and how that training was done weigh as much or more. A smaller, well-cooked model can outperform a huge, careless one. The number on the label isn't an intelligence grade; it's a size.

The knobs stay still

There's a detail left that has more consequences than it seems. Those knobs are set once, during training, and then they stay still. When you chat, the model uses the position they ended up in, but it doesn't change it: your conversation doesn't re-tune a single screw.

From this come two things we'll see soon. One, that the model knows about the world up to the day its training ended and not a minute more: its cutoff date. And two, that it doesn't store your chats inside its parameters, so by default it doesn't remember you from one conversation to the next. Both facts spring from the same thing: knobs that, once set, no longer move.

Definitions

- LLM (large language model): the program behind an AI chat. It has read enormous amounts of text and, with it, learned to predict which word comes next. "Large" refers to its number of parameters. - Parameter: each of the model's millions of internal "knobs." They're adjusted during training and, once set, are what the model knows. - Tensor: a multi-dimensional table of numbers where the parameters are stored. The model works by multiplying and adding tensors. - Transformer: the architecture, presented by Google in 2017, that lets the model look at all the words in a text at once and decide which to attend to. It's the basis of today's LLMs. - Parameters (the "B"): the figure like "7B" or "70B" indicates the billions of parameters in the model, that is, its size. More size doesn't by itself mean more accuracy.

No comments yet

No comments yet. Be the first.