AI Benchmarks Are Broken — This Book Explains Why
A new book by Moritz Hardt argues that benchmark rankings — not scores — are what actually matter. We tested his thesis against every major 2026 AI benchmark.
A new book by Moritz Hardt argues that benchmark rankings — not scores — are what actually matter. We tested his thesis against every major 2026 AI benchmark.

Here's a question worth sitting with: if every AI company cherry-picks the benchmarks where their model wins, do machine learning benchmarks even mean anything anymore?
Moritz Hardt thinks they do — but not for the reasons you'd expect. His new book, The Emerging Science of Machine Learning Benchmarks (Princeton University Press, forthcoming 2026), makes a case that's both surprising and overdue. The real value of machine learning benchmarks isn't the scores themselves. It's the rankings. That distinction matters way more than it sounds. And the current benchmark data for 2026's top models proves his point perfectly.
Hardt identifies three empirical patterns from the ImageNet era that still hold today:
Think of it like restaurant rankings. A restaurant rated 4.7 on one platform might be 4.2 on another. But the best restaurant in a city tends to be near the top everywhere. The absolute number is noise. The ordering is signal.

This is the book's core insight, and it reframes how we should read every leaderboard in AI. You shouldn't care that Claude Opus 4.6 scores 91% on MMLU. You should care that it consistently ranks at or near the top across multiple independent benchmarks.
The full book covers 15 chapters ranging from the holdout method and test set reuse through generative models and frontier evaluation. It's the kind of rigorous, academic treatment that benchmark discourse has badly needed — a proper scientific framework instead of Twitter arguments about whose leaderboard screenshot is more cherry-picked.
Rankings are the real scientific contribution of benchmarks — not the scores.
So let's apply Hardt's framework to where things actually stand. As of March 2026, here's a snapshot of the major machine learning benchmarks and how today's top models perform on them.
| Model | MMLU | GPQA Diamond | GSM8K |
|---|---|---|---|
| Claude Opus 4.6 | 91% | 91.3% | 97.8% |
| DeepSeek R1 | 90.8% | 71.5% | N/A |
| Claude Sonnet 4.6 | 89.5% | N/A | N/A |
| GPT-4o | 88.7% | N/A | 95.8% |
| o3 | N/A | 87.7% | 96.7% (AIME) |
| Gemini 2.5 Pro | N/A | 84.0% | N/A |
| Model | HumanEval | SWE-bench Verified |
|---|---|---|
| Claude Opus 4.6 | 97.8% | ~81% (w/ scaffold) |
| GPT-4o | 90.2% | N/A |
| DeepSeek V3 | 89.8% | N/A |
| o3 | N/A | 71.7% (w/ scaffold) |
| GPT-4.1 | N/A | 54.6% |
| Model | MATH |
|---|---|
| o3 | 99.2% (MATH-500) |
| o1 | 94.8% |
| Claude Opus 4.6 | 85.1% |
| DeepSeek R1 | 83.5% |
Look at those tables and something jumps out immediately. The absolute scores vary wildly depending on the benchmark. But the ranking story is remarkably consistent at the top.
This is exactly what Hardt predicted. Let's trace Claude Opus 4.6 across benchmarks: it ranks #1 in MMLU, #1 in HumanEval, #1 in SWE-bench Verified, and consistently in the top tier across everything else. The absolute gaps? Pretty small in most cases. The difference between 91% and 90.8% on MMLU is barely 0.2 percentage points — you'd be hard-pressed to notice that in everyday use.
But the ranking consistency — showing up near the top across six or seven different evaluations — tells you something real about general capability.
A 2% score difference is noise. Showing up in the top 3 across eight benchmarks is signal.
OpenAI's o3 crushes advanced math (99.2% on MATH-500) and reasoning puzzles (87.5% on ARC-AGI-1 at high compute, according to ARC Prize data). But it doesn't dominate general knowledge or coding tasks the same way. That's not a flaw in the benchmarks — it's actually benchmarks doing their job, revealing that o3 is a specialized reasoning model, not a general-purpose one.
Hardt's book calls this "external validity" — when rankings on one benchmark predict performance on others. The models that rank highly everywhere (not just on their pet benchmark) are genuinely stronger general systems.
Not all machine learning benchmarks are created equal, and Hardt spends significant time on why some evaluations are more informative than others. The book discusses the "adaptivity problem" — when benchmarks become targets, they stop being useful measures. Recent research has shown that some models may even fake their reasoning process, making honest benchmarking even more critical.

MMLU is a good example of a benchmark approaching saturation. As of March 2026, the top five models all score within about 4 percentage points of each other. When everyone gets above 88%, the benchmark can't differentiate anymore. It's like grading a spelling test where the whole class gets an A.
The more revealing benchmarks right now are:
SWE-bench Verified — This asks models to solve real GitHub issues from actual open-source projects. The spread here is massive: Claude Opus 4.6 hits ~81% with scaffolding while the original DeepSeek R1 manages 49.2%. That 32-point gap tells you something meaningful about practical coding ability that a 2-point MMLU gap simply can't. For a concrete example of how models stack up on real coding tasks, see our Nous Coder-14B vs Claude Code benchmark showdown.
ARC-AGI — Designed to test genuine abstraction and reasoning. The results are humbling. o3's preview reached 87.5% at high compute on ARC-AGI-1, and most models cluster below 55%. This benchmark still has room to breathe.
GPQA Diamond — Graduate-level science questions that even PhD experts struggle with. The gap between o3 at 87.7% and the next tier (mid-70s to mid-80s) suggests real differences in deep reasoning capability.
Interesting wrinkle: Hardt's book also covers the failure modes, and they're visible in today's data. The Chatbot Arena is fascinating because it measures something completely different: human preference in blind comparisons.
| Model | Arena Rating |
|---|---|
| Claude Opus 4.6 | ~1501 |
| Claude Sonnet 4.6 | ~1465 |
| Gemini 2.5 Pro | ~1448 |
| Grok 3 | ~1412 |
| GPT-4o | ~1345 |
The spread across the top five is roughly 150 Elo points. For context, in chess a 150-point Elo difference means the higher-rated player wins about 70% of the time — a meaningful edge, though not dominant. And yet these same models show huge differences on SWE-bench and ARC-AGI.

This is what Hardt calls the "performativity" problem — benchmarks that rely on human preference are measuring something, but it's not purely model capability. It's a blend of style, formatting, verbosity, and actual helpfulness that's almost impossible to untangle.
The Chatbot Arena tells you which model people like. SWE-bench tells you which model actually works.
The book also tackles data contamination — the real possibility that models have seen benchmark questions during training. This is a serious concern for older benchmarks like MMLU and GSM8K, where questions have been public for years. It's yet another reason why the newer, harder benchmarks are more trustworthy indicators as of March 2026.
The book is strongest on theoretical foundations — Hardt sets up a clear intellectual agenda across 15 chapters, from the holdout method through frontier evaluation. But it's focused on the academic side. A few things I'd want to see expanded:
Cost-performance tradeoffs. A model that scores 2% higher but costs significantly more per token isn't automatically "better" for most use cases. As of March 2026, Anthropic's Opus 4.6 runs $5/M input tokens while Mistral Large 2 charges $2/M — and LLM benchmarks rarely account for this pricing gap.
Latency and throughput. Real-world applications care about speed. A model that aces HumanEval but takes 30 seconds per response is useless for autocomplete in tools like Cursor or GitHub Copilot.
Agentic evaluation. SWE-bench Verified gets closest here, but we need benchmarks that test multi-step planning, tool use, and error recovery — the stuff that matters for AI agents in production.
Here's the practical takeaway, informed by both Hardt's framework and the current data:
Stop fixating on any single benchmark. A model that tops one leaderboard but drops off on others probably got optimized (or lucky) for that specific test.
Look at ranking consistency. As of March 2026, Claude Opus 4.6 and o3 are the two models that most consistently rank near the top — but in different domains. Opus dominates general knowledge and coding. o3 dominates math and abstract reasoning.
Weight the harder benchmarks more. SWE-bench, ARC-AGI, and GPQA Diamond have more room to differentiate models than MMLU or GSM8K, which are hitting ceiling effects.
Factor in what you're actually building. If you're building a coding agent, SWE-bench Verified matters far more than MMLU. If you need a research assistant for scientific papers, GPQA Diamond is your best proxy. Match the benchmark to your use case — not the other way around.
And honestly? Try the models yourself. Benchmarks are maps, not territory. Hardt's book makes a convincing case that they're useful maps — but maps with known distortions, outdated legends, and regions marked "here be dragons."
The Emerging Science of Machine Learning Benchmarks is the kind of book the AI field has needed for a while. It takes something everyone argues about and builds actual scientific reasoning around it. The central insight — that rankings matter more than scores — is simple enough to explain over coffee but rigorous enough to hold up under scrutiny.
The 2026 benchmark data backs this up. Stop chasing percentage points. Start reading the rankings.
Sources
Yes. Moritz Hardt has published the full manuscript at mlbenchmarks.org, and it's free to read in your browser. A hardcover edition from Princeton University Press is expected later in 2026. The online version covers all 15 chapters plus a preface and prologue.
ARC-AGI is currently the most resistant to gaming because it tests novel abstract reasoning rather than memorizable facts. Each puzzle requires genuine pattern recognition that can't be solved by memorizing training data. SWE-bench Verified is also difficult to game since it uses real open-source GitHub issues that require multi-step code changes across actual repositories.
Major ranking shifts happen roughly every 3-6 months when new frontier models launch. However, the relative ordering at the top tier tends to be more stable than the scores suggest — a pattern Hardt's book specifically predicts. Mid-tier rankings shift more frequently as smaller models iterate faster.
Use benchmarks as a first filter, not a final decision. Match the benchmark to your use case: SWE-bench Verified for coding tools, GPQA Diamond for research applications, and Chatbot Arena Elo for consumer-facing products. Then run your own evaluation on 50-100 real examples from your domain. Also factor in pricing, latency, and context window size — none of which benchmarks measure.
SWE-bench Verified is the gold standard for coding agent evaluation as of March 2026. It tests real-world software engineering on actual GitHub issues, not just isolated function generation like HumanEval. The gap between top models (~81%) and mid-tier (~49%) is much larger here than on HumanEval, making it far more useful for distinguishing coding capability. Also watch for LiveCodeBench, which uses fresh competitive programming problems to avoid data contamination.