Your brother-in-law walks into the living room, looks at the crooked Christmas tree that took you two hours to put up and says, deadpan, "very professional". You get it. So does your six-year-old. The dog senses something from the tone. There's only one agent in the room taking it literally. The smart speaker, which seizes on the word to suggest holiday decoration services.
That silly little scene sums up the problem. And the problem is enormous.
The figure nobody wants to look at
Raymond W. Gibbs published in 1994 a book that, in its day, made a good part of computational linguistics uncomfortable, The Poetics of Mind. The uncomfortable thesis was this. Everyday speech between adults is saturated with non-literal language, to the point that the figurative stops being the exception and becomes the default mode. Metaphor, irony, hyperbole, insinuation, sarcasm, double meaning, deixis (words like "here", "yesterday", "you", whose meaning depends on who says them and where) that shifts with the situation, cultural allusion only understood by someone who shares the repertoire.
A specific figure has become popular off the back of that work: that between sixty and eighty per cent of everyday speech would operate at non-literal levels. It's worth saying this honestly. Gibbs argues for the omnipresence of figurative language, he doesn't establish that exact percentage as robust data; the figure is an estimate that has circulated more than a settled measurement. But the order of magnitude, that intuition that the literal is the minority, is hard to argue with for anyone who listens to how real people actually talk.
We don't speak by saying what we say. We speak by saying something else, taking for granted that the other person will reconstruct what we mean from the context.
The figure stings, approximate as it is, because it knocks down the assumption on which most automatic language processing is built. Meaning, it's assumed, lives in the words. It doesn't live in the words. It lives between the words, in the situation, in who's speaking to whom, in what we've been saying for the last twenty minutes, in what shared culture takes for granted. Words are the visible tip.
Thirty years after Gibbs, the intuition is still there, and the pragmatics that came after has only reinforced it. And almost all the benchmarks (standardised tests for evaluating models) of language comprehension still behave as though it were a lie.
What an LLM does capture
It's worth being fair. An LLM (large language model, a system trained on massive amounts of text to predict sequences of words) does detect part of non-literal language. Specifically, the easy part. The part that comes signed.
When someone writes "I love having the flu, seriously /s", any decent model recognises the sarcasm. When there's a winking emoji at the end, too. When the irony is built with an explicit marker — mocking caps, parodic exclamation marks, a tag along the lines of "note the irony" — the system gets it right at respectable rates. Recent evaluations confirm it.
The trouble starts when you take away the scaffolding.
Zhang and colleagues published in 2024, in SarcasmBench, a systematic evaluation of eleven LLMs and several pretrained models against six sarcasm-comprehension datasets. The result is uncomfortable for the optimists. The large language models perform below specific supervised classifiers on all six tests. GPT-4 is, by some distance, the best of the generalists — the authors credit it with an average improvement of fourteen per cent over the rest — but "the best of the generalists" still falls short against a modest model trained on purpose for the task. And one telling detail: explicit chain-of-thought reasoning, the technique that boosts performance on logic or maths problems, barely helps with sarcasm. Spotting a dig isn't a step-by-step process. Either you catch it in one go or you don't catch it.
Yi, Xia and Long, in January 2025, work the same frontier with a different methodology. Theirs is, in fact, a hopeful attempt: they design an instruction scheme so an LLM detects irony in zero-shot (with no prior training on the specific task, just the instruction) and, on top of that, explains its reasoning. The result they report is nuanced, and that's why it's interesting. With that scaffolding, the generalist model gets close to the performance of supervised models trained on purpose — it even beats them on one of the datasets — but a gap remains. After all the apparatus, all the scale, all the instruction engineering, the most you achieve is roughly to draw level with a modest classifier that did see examples of the task. Generality opens no decisive advantage. It draws level, at best.
The benchmark trap
There's something very serious to say here, and it's worth saying without make-up.
Irony- and sarcasm-comprehension benchmarks are built, in the vast majority of cases, from datasets labelled by human annotators. The annotators read a message and decide whether it's ironic. For there to be inter-annotator agreement — a basic methodological requirement — the chosen texts tend to be the ones where the irony is identifiable without additional context. That is: marked irony, explicit irony, decontextualised irony that's still recognisable.
The irony that doesn't fit on the spreadsheet
Everyday irony, the kind that operates between adults who share history, culture and immediate situation, is by definition opaque to the external observer. If you put it in a dataset, the annotators would disagree on whether it's ironic, because to understand it you'd have had to be in the room.
That irony — which is the vast majority — doesn't make it into the benchmark.
What gets measured is the tameable subset. And when models score high on that subset, someone announces that they now understand natural language. They don't understand it. They've passed an exam written by the only kind of teacher who could grade them, one who only asks what can be asked in writing without being there. The unmarked part, the kind that requires having been present, is invisible to the metric.
Bender and Koller put it in 2020 with an elegance that still hurts. A system that learns only from form, with no access to meaning anchored in world and experience, can't aspire to natural language understanding. It can aspire, at most, to a statistically convincing imitation. What they call climbing towards NLU (natural language understanding) is precisely the confusion the metrics produce. Each step looks like progress towards understanding, and is really just a better fit to the form of the surfaces we'd already labelled.
The social consequence, which is the part that matters
Let's do the cold sums. If everyday speech operates at seventy per cent on non-literal levels, and the automated language-mediation systems — translation, predictive messaging, customer service, automatic summaries, content moderation, transcription — only reliably handle the thirty per cent that's literal, what happens to the remaining seventy per cent when these systems step in?
Three things happen, all bad.
Three bad ones, in order
The first. The system translates or summarises by flattening everything literal. The irony disappears, the double meaning is left with only one of its two — always the literal one — and the original message reaches the recipient as an impoverished version that says exactly the opposite of what the sender meant. Anyone who has ever seen a sarcastic comment machine-translated knows what I mean. The joke dies. Criticism turns into praise. Reproach turns into approval. And the receiver has no way of noticing.
The second. The system moderates or filters. Here the problem flips sign, but it's the same problem. When a platform uses automatic classification to detect harassment, threats or hate speech, the false positives and false negatives don't fall at random. They concentrate exactly in the zone of language the machine doesn't understand. The threat wrapped in irony gets through, because it literally isn't a threat. The affectionate comment between friends using a shared code — insulting each other fondly, say — gets flagged as toxic, because it literally is. The system doesn't tell the insult that's a compliment apart from the insult that's an insult. The difference isn't in the words.
The third, the most perverse.
When an automated system acts as an interface between humans on a massive scale, it doesn't just misread what's already there. It starts shaping what's coming. People who know their message is going to pass through an automatic filter write for the filter. They flatten the irony. They strip the double meaning. They draft to be understood by the machine, not by the person. And language, from passing through the bottleneck over and over, gets its head left sticking out. Public conversation migrates towards the subset the machine can process. The other seventy per cent drops off the register. It isn't that the AI doesn't understand language. It's that language is being modified so the AI understands it.
That's an enormous cultural loss, and no one is reporting it because there's no benchmark to measure it.
What's missing isn't more data
Here comes the part that makes the industry uncomfortable.
The reflex answer to all of the above is: fine, so we need to improve irony detection, train on more contextual examples, fine-tune the models with pragmatic data. The industry is convinced that any limitation of the model is fixed with more data of the right kind.
That isn't the problem. Or it isn't only that.
You had to have been there
Everyday irony isn't opaque for lack of examples. It's opaque because to decipher it you need to know what's at stake between two people in a situation the model didn't share. You need to have been in a kitchen, to have been burnt by the coffee, to have been the butt of a joke, to have felt the social weight of staying quiet when everyone else is laughing. You need a body, a biography, a position in a group. Lakoff and Johnson wrote it in 1980, in a book artificial intelligence ought to pay more attention to than it does. The metaphors that structure our thought are rooted in what it is to be an animal of flesh, with a weight, a verticality, an inside and an outside, an advancing and a retreating. Take away the body and the ground goes out from under the metaphor's feet.
Grice formulated the conversational maxims (the implicit rules speakers assume in order to understand each other, proposed by the philosopher Paul Grice in 1975) starting from an assumption computational linguistics still prefers not to look at. When someone says something that apparently violates a maxim — says something obvious, or irrelevant, or false — the listener assumes they're really saying something else, and reconstructs what. That reconstruction isn't statistical. It's a situated inference that requires modelling the other person. Knowing what they know, what they don't, what matters to them, what they're willing to pretend.
A system that doesn't model the other as a subject with their own experience can't make that inference. It can make an imitation. It can learn that a certain syntactic pattern is followed by a certain interpretation at a certain frequency. But it can't do what your brother-in-law does when he walks into the living room. Read the scene, read your face, read the history of the two of you, and pick the two words that'll deflate your soul without him having to say anything nasty.
When you realise they don't understand
There's a specific moment, almost always, when any prolonged conversation with an automated system lays the hole bare. It's not when it gets a fact wrong. Facts get corrected. It's when you answer with an irony, or an allusion, or a sidelong comment that takes a shared context for granted, and the reply comes back flat, head-on, read at face value. And you understand the interlocutor isn't just unperceptive. It's on another plane. It works with a different material than the one you work with.
You're conversing. The system is processing text.
And text, let's remember, is at most twenty per cent of what's going on when two people talk.
Definitions
LLM (large language model). An artificial intelligence system trained on large volumes of text to predict sequences of words. Today's conversational models are LLMs fine-tuned with instruction and human feedback.
Benchmark. A standardised set of tests used to compare model performance. In language processing, a benchmark usually consists of human-labelled datasets against which the model's accuracy is measured.
Zero-shot. An evaluation mode in which a model is asked to solve a task for which it has received no specific training examples. It's given only the instruction and asked to respond.
Deixis. The phenomenon by which certain words — "here", "yesterday", "you", "this" — only have a concrete referent when the speaker's situation is known. Without situational context, they're empty shells.
Conversational maxims. Implicit rules formulated by the philosopher Paul Grice in 1975 that speakers mutually assume to make conversation possible. Their apparent violation is precisely what triggers the pragmatic inference that lets us understand irony and insinuation.
NLU (natural language understanding). A subfield of artificial intelligence devoted to the real comprehension of meaning, as opposed to the mere processing of textual form.
References
Gibbs, R. W. (1994). The Poetics of Mind. Figurative Thought, Language, and Understanding. Cambridge University Press. The argument for the omnipresence of figurative language in everyday speech, cited at the article's opening. The sixty-to-eighty-per-cent figure that has circulated off the back of this work is a popularised estimate, not a settled percentage Gibbs himself establishes as a measurement; it's presented as such in the article.
Zhang, Y., Zou, C., Lian, Z., Tiwari, P. and Qin, J. (2024). SarcasmBench: Towards Evaluating Large Language Models on Sarcasm Understanding. arXiv:2408.11319. . An evaluation of eleven LLMs and several pretrained models over six sarcasm datasets; it concludes that LLMs perform below specific supervised classifiers, that GPT-4 is the best of the generalists with an average improvement of fourteen per cent, and that explicit chain-of-thought reasoning doesn't help on this task. Cited in the section on what an LLM captures.
Yi, P., Xia, Y. and Long, Y. (2025). Irony Detection, Reasoning and Understanding in Zero-shot Learning. arXiv:2501.16884. . Proposes an instruction scheme (IDADP) with which a generalist LLM in zero-shot reaches performance comparable to that of supervised models fine-tuned to the task, beating them on one of the datasets, but without opening a general advantage over them. Cited when discussing whether the model's generality helps in irony detection.
Bender, E. M. and Koller, A. (2020). Climbing towards NLU. On Meaning, Form, and Understanding in the Age of Data. Proceedings of the 58th meeting of the Association for Computational Linguistics (ACL 2020), pp. 5185-5198. . The central argument on the impossibility of natural language understanding from form alone, cited in the section on the benchmark trap.
Grice, H. P. (1975). Logic and Conversation. The origin of the conversational maxims and the notion of pragmatic implicature referred to at the close of the article's body.
Lakoff, G. and Johnson, M. (1980). Metaphors We Live By. University of Chicago Press. The bodily anchoring of cognitive metaphor, cited when discussing the limits of a system with no body or biography.
To go deeper
Sperber, D. and Wilson, D. (1986). Relevance. Communication and Cognition. Blackwell. Relevance theory, a natural extension of the Gricean programme and a useful framework for understanding why statistical systems trip over pragmatic inference.
You might also like
- Thinking without understanding. If it works the same, what did we think thinking was
- Statistics as a substitute for knowledge
- Measuring artificial intelligence
- Human language as an ambiguous system
No comments yet
No comments yet. Be the first.
Leave a comment