"Which is better, Claude, ChatGPT or Gemini?" I've heard it so many times the tone no longer irritates me. What irritates me is the verb. "Better" presupposes a single axis where the three line up from worst to best, and a table somewhere that settles the matter. That table doesn't exist. What there is are three bets that have decided to sacrifice different things, and almost every number used to compare them is contaminated, saturated, or both at once.
I've spent months bouncing that question back with another that nearly always makes people uncomfortable: better for what? Whoever asks usually wants a test verdict, a brand and a number that spare them the thinking. And the honest answer starts by dismantling the number. Measuring is fine, of course it is; science measures. But measuring the wrong magnitude to three decimal places isn't rigor. It's set dressing.
The Number They Show You Doesn't Measure What You Think
When someone tosses out that such-and-such model "scores a 90 on MMLU," what they're saying, translated, is that the model got 90% of a test covering fifty-seven academic areas right. It's impressive. Until you trace where the figure comes from.
MMLU was assembled in 2020 with questions taken from public exams. The problem came afterward. If you train a model on practically all of the internet, the internet already contains those questions: with their answers, their worked solutions and the forum thread where three strangers argue about them. The model doesn't need to reason to get them right. Having read them is enough. The phenomenon has a name, data contamination, and it's documented with method.
The clearest case was left by Time Travel in LLMs (Golchin and Surdeanu, 2023). Its authors showed that GPT-4 had ingested, during pre-training, datasets like AG News and WNLI, plus the test split of XSum. Nobody copied answers by hand. The dataset was in the corpus and the model absorbed it just as it absorbed everything else, without telling the exam apart from the rest of the noise.
The second case quantifies the cheat, and that's why I like it more. Scale AI built GSM1k, a grade-school arithmetic exam, a twin of GSM8k but with brand-new questions, unpublished, impossible to have read before, and measured the models on both. The Mistral and Phi families collapsed by some thirteen points when jumping from the old exam to the new one. The frontier ones —Gemini, GPT, Claude— barely flinched. What the study put in writing was as uncomfortable as it was precise: part of what we'd been calling capability was memory of the exam.
Hence my distrust of the round percentage. The figures from the classic benchmarks —MMLU, HumanEval, GSM8k, ARC— inflate real capability on tasks the model hasn't seen before. And your work, almost by definition, is one of those tasks: your report wasn't on the internet with the solution underneath. There's also a second effect, less discussed and a fair bit more terminal: many of these exams no longer distinguish anything. On MMLU-Pro, the hardened version, the frontier models bunch up in a narrow band around 85-90%, and within that band the differences blur into the test's noise.
The industry knows this and has counterattacked with benchmarks that continuously generate new questions, like LiveBench or FrontierMath, designed precisely so they can't be memorized. There's also Chatbot Arena, now renamed Arena, where users blind-vote on which answer they prefer to their own real query. It's the least manipulable system, because the human poses the question and the model couldn't prepare it in advance. In its tables from the start of 2026, Claude Opus 4.6 topped the general text ranking, with Gemini 3.1 Pro and OpenAI's models a handful of Elo points behind, inside overlapping intervals. Best read as a tight lead band and not as a clean podium.
And even so, Arena doesn't answer your question either. It tells you which model is most liked, on average, by a crowd asking disparate things. It doesn't tell you which will serve you, tomorrow, for what you do.
What Each One Has Given Up
Here's the part no comparison names, and it's the one that really decides. Each model shines where its incentives force it to shine; what's invisible, and decisive, is the price it has paid in exchange.
Claude, from Anthropic, comes from a declared doctrine of alignment and caution. Hence its solvency in long texts, its stamina for sustaining an argument without crumbling by the third page, its careful hand with thorny topics. The price is obvious to anyone who uses it daily: it's cautious to the point of exasperation. Ask it for something on the edge and it'll qualify rather than risk; ask it for unbridled creativity and at times it'll bridle you. It answers more slowly than Gemini, and its catalog of connections to external tools is shorter than ChatGPT's.
ChatGPT, from OpenAI, bet on the opposite, and the bet was early and loud: massive rollout, fast iteration, an ecosystem of extensions, deep integration with the Microsoft universe. What you take in exchange is breadth. Fast answers, image generation, audio transcription, code execution, web search, a swarm of third-party tools orbiting the model. The bill arrives as worse-calibrated caution. It's more prone to hallucinations —invented facts that sound plausible— precisely because it's tuned to answer as soon as possible and not to stop and verify. In very long texts it loses some thread next to Claude, and sometimes it trades depth for speed without warning.
Gemini, from Google DeepMind, bet almost everything on a single card: living inside Google's ecosystem. Workspace, Drive, Gmail, Calendar, Search. What you take is immediate context. It pulls from Search so it doesn't stay anchored in the past, handles image, audio and video natively, digests the long documents in your Drive and tidies your mail without you having to take anything out of its place. The bill here comes in two line items. The first is dependence: outside Google's garden —if your life runs in Microsoft 365, in Apple, in a bare Linux— much of its charm evaporates. The second is reputation, harder to quantify and stickier; the public stumbles of its early versions left it with a distrust that its current models, frankly good, still drag around without quite deserving it.
None of the three renunciations is an accident. They're engineering decisions each company optimizes on purpose, day after day. Each model turns out excellent at certain tasks for the exact same reasons that make it mediocre at others. There's no top of the class waiting to be crowned. There's, at most, the one that fits what you need to do this week.
Three Jobs, Three Tools
I'll bring it down to earth with a rule of thumb, with no pretense of law or table.
If your thing is writing a lot —a long report, a proposal, an essay, a piece of journalism, technical documentation— Claude usually performs better: it holds the register from the first page to the last, weaves citations without contradicting itself, and doesn't come undone over the long distances. If your thing is dispatching a thousand small tasks at full speed, with web search, image generation, format conversion and quick data analysis, the one that performs is ChatGPT, for the cloud of utilities surrounding it. I've summed it up more than once with a silly image that works: ChatGPT is a Swiss Army knife and Claude is a good pen. Both serve. They don't serve for the same thing.
And if your whole day runs inside Workspace, with mail in Gmail, documents in Drive, meetings in Meet and the agenda in Calendar, the sensible choice is Gemini. It summarizes mile-long threads, prepares agendas, reviews documents without your having to copy them anywhere. That zero friction, added up over a month, weighs much more than it seemed the first day.
I'm not describing a ranking but a fit, which is a different geometry. And if your work pushes you to do all three things, the honest exit is to use all three, each where it performs. What a lot of people do instead —sticking with the one they already had open and clinging to it out of pure inertia— costs real productivity. Only it's a bill nobody bothers to measure.
The Cost of Switching, Which Is the One No One Mentions
There's a silent charge the comparisons don't talk about. Each model, once embedded in your workflow, raises an exit cost that has nothing to do with the monthly fee, which is precisely the least of it.
The way you write to it adapts to the one you use most. You learn which formulas work with Claude, what tone, what level of detail, how much scaffolding you need before it understands you. And likewise with the others. That skill is paid in hours, it's worth money, and it stays stuck to the model the moment you abandon it. Switching means re-educating your own hand almost from scratch.
The same goes for memory. A months-long conversation with Claude about a project in progress is Claude; moving it to ChatGPT forces you to copy, reformat and lose nuances along the way. Your fine-tuned instructions and your custom assistants live in the house where you built them, they don't travel. And the connections you went around plugging in —this service hanging off that model, that automation depending on the other— break when you move and have to be redone one by one. It's configuration work disguised as a single click.
There's a reason the big providers give away access and hand out credits by the fistful. They don't do it out of generosity, obviously. They know that exit cost will be collected later, transformed into a half-forced loyalty, and that every new user is a long-term asset. Economic jargon christened those structural trenches a moat, the castle's ditch, and they have the virtue of appearing on no price tag.
The defense isn't heroic or comfortable: it consists of not building your whole workflow on a single provider, doing cross-tests even when they're a chore, and keeping a second tool half-running, alive, just so you don't lose the habit of comparing. The day a fourth contender appears —DeepSeek, Mistral, Qwen, whatever comes— with something clearly better for what you do, you'll be grateful not to be tied down.
The Elephant in the Room. Open Weights
To talk about Claude, ChatGPT and Gemini in 2026 without naming the open-weights models is to write with an expired map. The irruption of DeepSeek at the start of 2025 moved the sector's plates, and since then there are open-weights models that, on specific tasks, play in the league of the closed commercial ones. Llama, from Meta. The more capable versions of Mistral. Qwen, from Alibaba. The DeepSeek family. And, since 2025, even OpenAI itself with its open-weights line, which is almost a confession.
What they change is fundamental. You can run them on your own hardware without sending a single byte of data to someone else's server. You can fine-tune them with yours for a very specific task. You're not tied to anyone's commercial policy: if they raise the price, if they retire the old model, if they rewrite the terms overnight, your local install keeps working like yesterday. And in the sectors where confidentiality isn't a wish but a legal obligation —health, defense, finance, law— they're often the only door left open.
What do you lose along the way? For the ordinary user, almost everything. They demand your own infrastructure or intermediate hosting services. The experience is rougher, with no polished app, no out-of-the-box integrations, no support to write to when something breaks at two in the morning. Raw capability can sit a notch below the frontier when you truly need the best of the best. And they lag on the calendar: what a commercial lab launches this week, the open community takes a while to absorb.
The reasonable outcome is for the ecosystem to drift toward the hybrid. Big companies and professionals with sensitive data will pull from a mix: closed models for the general stuff, open models local for the delicate stuff, and some orchestrator dividing the work case by case. The casual user will stay with Claude, ChatGPT or Gemini, oblivious to all this, and rightly so. The serious user will use everything. What open weights contest isn't the top spot among the big three, but dependence itself, which is a very different adversary.
The Question That Really Bites
After all this, "which is better?" has an answer that matches none of the three names. It's this: which fits best in this task of yours, this specific one, tomorrow morning's. And the only way to find out is to put them to work on your real job, a week or two each, instead of reading comparisons like this one.
Behind it a sharper one peeks out. What do I lose if I marry just one? You lose perspective, without noticing. Each model was raised on slightly different data, returns slightly different biases and has its own blind spots; always asking the same one is accepting a single point of view on the world without even knowing you've accepted it, and giving up on noticing where it deviates from the others. The people in the trade I know have all three open in tabs, and not out of collector's whim but to triangulate: when an answer matters, they cross-check; when it doesn't, they pull the fastest one. It's within reach of anyone with three subscriptions, or even with the free versions. The friction is slight and the return tangible.
And there's a third question, the one almost nobody asks and the one that costs me the most sleep. What am I giving the model every time I write to it? You give it data, sure, but not just any data: your real problems, your doubts, your way of reasoning a matter, your raw work. In the enterprise versions there are reasonable privacy guarantees; in the consumer ones they're murkier, and unless you have an explicit opt-out agreement, what you write may end up feeding, in aggregate, future versions of the model. Multiply that by millions of people over years and the model ends up knowing an entire trade from the inside. Whoever controls the model knows the trade. Choosing which chatbot you open in the morning isn't only choosing a tool: it's choosing who's going to know, five years from now, how you work and how your whole profession works.
That no longer gets fixed with a comparison listicle. It's a decision with long consequences, and the most uncomfortable thing about it is that there's no technical patch here that solves it for you. There's only awareness and informed choice, which are exactly the two things no company has any incentive to make easy for you, because nobody makes money giving them away.
Definitions
Data contamination. The situation in which a model was trained on data that already included an exam's questions and answers, which systematically inflates its apparent capability. Documented in GPT-4 with AG News, WNLI and the XSum test split (Golchin and Surdeanu, 2023), and measured in the Mistral and Phi families via the twin exam GSM1k (Scale AI, 2024).
Benchmark saturation. The state in which the frontier models score so high and so close together that the differences between them blur into the test's own noise. It's what happens with MMLU and is starting to happen with MMLU-Pro.
Chatbot Arena (now Arena). A platform where users blind-vote on which answer they prefer to their own question, and an Elo score comes out of those votes. It's the least manipulable system because the model can't anticipate the question it'll get.
Switching cost. What it actually costs to move from one model to another: re-educating the way you write to it, losing the accumulated contextual memory and redoing the integrations. It's the main defensive moat of current providers.
Open weights. A publishing mode in which the model's parameters are distributed freely, which lets you run it locally, fine-tune it with your own data and drop dependence on the provider. Llama, Mistral, Qwen, DeepSeek and OpenAI's open-weights line are relevant examples in 2026.
References
Golchin, S. and Surdeanu, M. (2023). Time Travel in LLMs. Tracing Data Contamination in Large Language Models. arXiv:2308.08493. Documents the contamination in GPT-4 with AG News, WNLI and the XSum test split. https://arxiv.org/abs/2308.08493
Zhang, H. et al. / Scale AI (2024). A Careful Examination of Large Language Model Performance on Grade School Arithmetic. arXiv:2405.00332. The origin of the GSM1k exam and of the measurement of the up-to-thirteen-point overfit in the Mistral and Phi families, against the near-zero collapse of the frontier models (Gemini, GPT, Claude). https://arxiv.org/abs/2405.00332
Arena / LMArena. Public ranking table by blind voting (Elo). The basis for the relative positions of the frontier models at the start of 2026, with Claude Opus 4.6 leading the text ranking and Gemini 3.1 Pro and OpenAI within overlapping intervals. https://lmarena.ai/
Artificial Analysis. MMLU-Pro Benchmark Leaderboard. The source for the score band (around 85-90%) of the frontier models on MMLU-Pro and for their clustering by saturation. https://artificialanalysis.ai/evaluations/mmlu-pro
Center for Responsible Decentralized Intelligence (RDI), UC Berkeley (2026). Research on agent-benchmark exploitation showing near-perfect scores obtained without solving any task on SWE-bench, WebArena, OSWorld, GAIA and Terminal-Bench, among others. https://rdi.berkeley.edu/
Carr, N. (2010). The Shallows. W. W. Norton. Conceptual reference on the effect of digital tools on the way we think.
También te interesa
- Anthropic and OpenAI, Two Ways of Understanding AI
- DeepSeek and the End of the American Monopoly
- Hallucinations and Lies. The Word the Industry Chose
No comments yet
No comments yet. Be the first.
Leave a comment