"Attention Is All You Need", the paper that changed AI

Today we're talking about a paper I'd have loved to have a hand in. June 2017. Eight Google researchers publish fifteen pages at NeurIPS under a title that's a little cocky, "Attention Is All You Need". Zero commercial promises. They propose a new architecture — the Transformer — that changes how machines process sequences. That paper is the source of 99% of the artificial intelligences you use today. ChatGPT, Claude, Gemini, DeepSeek — all of them come from there. And the explanation, no maths, fits in five minutes. Fancy reconstructing the hinge moment of the last twenty years in tech?

What came before (and why it matters to me)

In 2005 I started working with natural language at a company. What you did back then were lexical engines. They computed the probability that the next word would be one type or another. Just identifying the language of a text was already a job of several weeks. Splitting sentences, recognising entities, understanding context. We worked on this intensively between 2007 and 2009. I can vouch for how exhausting it was for the machines — the computation structures wouldn't fit in the RAM you had. No power, no algorithm, no architecture.

Meanwhile, in the United States, something different was happening. Recurrent neural networks — RNNs — and their LSTM variant had become the standard for processing sequences. They worked like this: a word goes into the network, the network updates its "internal state", out comes the next word. One by one. In strict order. Like a human reading.

Three structural problems with that design.

It was slow. Each word depended on the state the previous one produced. You couldn't parallelise. If your sentence had fifty words, that was fifty sequential steps.

It was forgetful. By twenty or thirty words out, the "internal state" had forgotten what mattered about the first ones. Long contexts got lost.

It was expensive to scale. To improve you had to stack more layers or use more complex variants (LSTM, GRU, partial attention), and each improvement paid less.

Machine translation was stuck. Text comprehension, stuck. The field had spent years pushing against a ceiling and it wouldn't break.

The idea that broke the ceiling

Eight people from the Google Brain team thought the opposite.

What if, instead of processing the words in order, we look at all of them at once and let each word decide which other words matter for understanding itself?

That's "attention".

Picture the sentence "the bank was packed because they were handing out benefits". The word "bank" is ambiguous — a bank you sit on or a financial bank. To resolve it, a person looks at the words "handing out" and "benefits", ignores "packed". The sentence makes sense because each word pays selective attention to the ones that disambiguate it.

The Transformer does exactly that. Every word in the text is compared with all the others. For each pair, it computes how much one "attends" to the other. The result is a new representation of each word that folds in the relevant context of the others.

No strict order. No step by step. All at once.

That's why it's called a Transformer — it transforms the input words into contextualised representations by means of attention.

What changed in practice

This is where the paper detonates.

Parallelisable. Since you no longer need to process in order, all the words are computed at once. On a GPU, that means training much faster. What took weeks in an RNN takes days or hours in a Transformer.

It captures long dependencies. Since each word can attend to any other, it doesn't matter whether they're side by side or ten thousand words apart. The model's "memory" stops decaying exponentially.

Scalable. If you want more capacity, you stack more attention layers. Results improve with scale. More data, more compute, better model. It's a predictable recipe. Not infinite — the cost climbs like a missile — but predictable.

The original paper showed these results on machine translation. But the architecture was so general that over the following months it got applied to almost everything. BERT (Google, 2018) used it for text comprehension. GPT (OpenAI, 2018) used it for generation. GPT-2 (2019), GPT-3 (2020), GPT-4 (2023), Claude, Gemini, DeepSeek — all direct descendants of the original Transformer.

ChatGPT doesn't exist without that paper. The whole public conversation about generative AI you hear today rests on fifteen pages published in June 2017.

The eight authors and where they are now

The paper carries eight signatures. All contributing in equal measure. The order they appear in was randomised — an unusual detail that already said something about the group's culture. Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin.

The diaspora that followed is, by itself, a history of the field.

Ashish Vaswani leads Essential AI. The company closed a $175 million Series B in May 2026 at a billion-dollar valuation, and released the Rnj-1 (Ramanujan) family in open weights.

Noam Shazeer returned to Google in August 2024 to co-lead the Gemini project, alongside Jeff Dean and Oriol Vinyals. He had previously founded Character.AI — one of the uncomfortable stories of AI companions — and Google structured the deal as a licence of Character.AI's technology valued, per the press, at around $2.7 billion, an arrangement that ended up drawing the attention of antitrust authorities.

Niki Parmar is at Anthropic.

Jakob Uszkoreit founded Inceptive — AI applied to designing molecules, RNA, vaccines.

Llion Jones co-founded Sakana AI in Tokyo, focused on biology-inspired architectures.

Aidan Gomez co-founded Cohere — models for enterprises, with a particular focus outside the US.

Łukasz Kaiser is at OpenAI, where he's the only one of the eight who hasn't founded a company and has stayed in research, tied to the line of reasoning models.

Illia Polosukhin founded NEAR Protocol — blockchain with an AI tropism.

Eight people, eight routes, eight companies. All on the frontier of the field. It's as if the eight co-authors had split up the entire sector and each one kept a piece.

The irony of the one who sowed and didn't reap

Google published the paper. Open science. No usage restrictions. Anyone could implement the architecture.

OpenAI implemented the architecture and scaled it with a different funding model. GPT-1 came out in 2018, GPT-2 in 2019, GPT-3 in 2020 — all based on the Transformer from Google's paper. In 2022 ChatGPT blew up commercially. By 2023 OpenAI was already seen globally as the company that "invented" generative AI.

Google, meanwhile, kept significant advances in-house — LaMDA, PaLM — but the defensive reflex of not shipping products until they were "polished" cost it the narrative lead. When it launched Bard in March 2023, it was already late against ChatGPT. The race for Gemini sped up through 2023-2024. The acqui-hire of Noam Shazeer in August 2024 was part of that acceleration.

Today Google competes against OpenAI with models descended from a paper its own researchers published as open science nine years ago. The company that invented the architecture is running behind the company that used it first. Each reader can draw the moral.

Why it's worth reading

If you work in tech and never read the original paper, give it an afternoon. You won't get the maths without prior grounding. You will get the core idea, the motivation, the design decisions and the questions they left open. That gives you context for everything that's happened since.

There's a feeling that only shows up when you read the original paper: brilliant ideas are surprisingly simple. The Transformer architecture isn't mathematically spectacular. It's elegant. It's minimal. It fits in fifteen pages. The greatness is in having thrown out what was surplus — convolution, recurrence, accumulated complications — and kept the minimum mechanism that was actually needed.

"Attention is all you need" is title and thesis at once. Attention is the only thing you need. The rest can go in the bin.

Whoever reads the paper today through that lens understands better why the field moves the way it moves. And why the alternative architectures that turn up every year — Mamba, RWKV, state spaces — fight so hard to dislodge the Transformer and find any dislodging so hard.

Quick definitions

Transformer: a neural network architecture based on attention mechanisms, with no recurrence or convolution. Proposed in Attention Is All You Need (2017).
Attention: a mechanism that lets each element of a sequence weigh the relative importance of the other elements in building its own representation.
RNN / LSTM: recurrent neural networks and their long-short-term-memory variant. The state of the art before the Transformer.
NeurIPS: Neural Information Processing Systems. One of the field's three main conferences (along with ICML and ICLR).
BERT: a Transformer-based text-comprehension model, Google 2018.
GPT: Generative Pre-trained Transformer. OpenAI's model family, started in 2018.
Open weights: model parameters made publicly available, which lets you download and run the model locally.

References

Vaswani, A. et al. — Attention Is All You Need (NeurIPS 2017, arXiv:1706.03762).
Devlin et al. — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018).
Brown et al. — Language Models are Few-Shot Learners (GPT-3) (2020).
Wikipedia — entry on Attention Is All You Need (includes updated biographies of the eight authors).
Official pages of Essential AI, Cohere, Sakana AI, NEAR Protocol, Inceptive, Anthropic.
CNBC (2 August 2024) — Ex-Google engineers from Character.AI re-join company with AI partnership (cnbc.com/2024/08/02/ex-google-engineers-from-characterai-re-join-company-with-ai-partnership-.html): Noam Shazeer and around 30 team members rejoining Google and the non-exclusive licence of Character.AI's technology.
Calcalist — Google's $2.7B AI deal with Noam Shazeer's Character.AI draws DOJ attention (calcalistech.com/ctechnews/article/sy06wllflg): the deal figure at around $2.7 billion and the Department of Justice's scrutiny.
Calcalist — Noam Shazeer returns to Google to co-lead Gemini AI project (calcalistech.com/ctechnews/article/rksxmxsj0): the August 2024 return to co-lead Gemini alongside Jeff Dean and Oriol Vinyals.

Elsewhere

#transformers #papers #intelligence #openai