For a while now I've been watching a new model get announced by the leaderboard before it gets announced by what it's good for. MarkTechPost ran a headline saying Alibaba's Qwen3.6-27B beat a 397-billion-parameter model on agentic coding tests; other coverage crowned it number one on six code benchmarks at once. The selling point is no longer "it does this," it's "it scores higher than that one."
Alibaba presented Qwen 3.6 with open weights and the claim that it tops six code and agentic-task benchmarks, beating even much larger rivals. The fine print comes from the researchers themselves: teams like Berkeley's have been warning for a while about data contamination in these tests —models that have seen parts of the exam during training— and OpenAI stopped reporting some results for that very reason. The leaderboard rules, even if what it measures is up for debate.
I think the question underneath is what a model is trained for: to be useful, or to pass the exam? It's the same difference there is between studying to master a subject and studying to pass the quiz. They aren't the same, even if they look it, because in both cases it's the grade that defines you. A model leading six benchmarks can be an excellent tool or a swot who'd already seen the questions. And as long as the headline is the ranking, we'll be rewarding the second.
Sources: MarkTechPost · Qwen (official repository) · CodeAnt AI (benchmark contamination)
Rain — 4 more tears
This tear is rain for other tears. Wander.
No comments yet
No comments yet. Be the first.
Leave a comment