Evaluation and Benchmarks

PM: Read in full — 20 min

The Hardest Problem in AI

Evaluating language models is harder than evaluating almost any other software system. The output is open-ended text. There is often no single correct answer. And because models train on internet data, any benchmark that appears publicly can contaminate the training set — making its score meaningless.

Despite this, benchmarks are everywhere. Understanding what they measure, what they don't, and how to read leaderboards critically is a practical skill for anyone shipping AI products.

Static Academic Benchmarks

These consist of fixed test sets with multiple-choice or short-answer questions, scored automatically.

MMLU (Massive Multitask Language Understanding) — 57 subject areas from elementary math to professional law, 14,000+ questions. The most cited general-knowledge benchmark. All frontier models now score above 90%, and the benchmark is widely considered saturated. Approximately 5–10% of MMLU questions appear in standard pre-training datasets, which inflates scores further — a model that "improves" on MMLU may simply have seen more of the test.

HumanEval / MBPP — coding benchmarks that measure whether generated code passes test cases. More objective than language tasks (code either runs or doesn't). Frontier models now score 90–99%, with the top models approaching ceiling on the standard benchmark. Harder replacements — BigCodeBench, SWE-bench Verified — are the current meaningful signal for coding capability.

GSM8K / MATH — grade-school and competition math. Tests multi-step reasoning. Still differentiating for the hardest problems.

BIG-Bench Hard — a subset of a 200+ task suite focused on tasks where chain-of-thought prompting makes a measurable difference. Designed to avoid saturation by including tasks that required reasoning beyond surface-level pattern matching.

Harder replacements (Humanity's Last Exam, FrontierMath): as frontier models saturated MMLU and HumanEval, harder evaluations emerged. Humanity's Last Exam draws expert-level questions from 100+ disciplines; at launch (2024–25) frontier models scored below 10%, but by mid-2026 the best models reach the low-50s with tool access (e.g., Claude Opus 4.6 ≈ 53%), while scores without tools remain far lower.¹ FrontierMath presents novel competition-level math problems constructed to prevent memorization — current model scores are in the single digits. These benchmarks resist contamination by design: many problems are novel or algorithmically generated.

The four benchmarks that actually differentiate frontier models in 2026: MMLU and HumanEval are saturated. The practitioner consensus is that GPQA Diamond, Humanity's Last Exam, SWE-bench Verified, and LiveCodeBench are the four evaluations worth tracking — they resist data contamination, reward genuine reasoning, and still separate the best models from the rest.

Limitations:

Scores inflate as models train on data that includes the benchmark questions (data contamination)
Performance on a benchmark ≠ performance on your actual use case — a model scoring 97% on HumanEval still routinely generates code with hallucinated API calls or incorrect function signatures in real codebases
They measure specific capability slices, not general quality
Real-world failures don't appear in leaderboard tables: in 2023, lawyers in two separate federal court cases filed briefs citing AI-hallucinated cases that did not exist, resulting in court sanctions — the failure occurred because the lawyers asked the model to recall specific legal citations from memory, which is exactly the task type where LLMs are unreliable. Hallucination is task-type-dependent: open-ended recall from training data is high-risk; structured extraction from provided text, format compliance, and pattern-based generation are dramatically more reliable

Human Preference Evaluation

LMArena (Arena AI, formerly LMSYS Chatbot Arena) — real users conduct blind side-by-side conversations with two anonymous models and vote for the better response. Winners accumulate Elo scores. Currently the most trusted signal for real-world quality because:

Real users, real queries, real preferences
Models don't know they're being evaluated, preventing gaming
ELO is robust to selective reporting

Tradeoff: Arena rankings lag — new models need weeks to accumulate stable scores.

MT-Bench — multi-turn conversations across 8 categories (writing, reasoning, math, coding, roleplay, extraction, STEM, humanities), rated by LLM-as-judge. Faster than Arena for initial quality signals.

LLM-as-Judge

A pattern where you use a capable LLM (GPT, Claude, Gemini) to evaluate another model's outputs against a rubric. Scales better than human evaluation but inherits biases:

Judges prefer longer responses (length bias)
Judges prefer responses matching their own style (self-enhancement bias)
Judges are inconsistent on subjective criteria

Best used for ranking relative quality, checking format adherence, and flagging obvious failures. Not reliable as a sole quality signal for nuanced evaluation.

Task-Specific Evaluation

For production applications, industry benchmarks are almost never the right measure. What matters is: does the model reliably do the specific thing your product needs?

Build your own eval:

Collect 100–500 representative real examples from your use case
Define a rubric: what does "good" look like for each example type?
Have humans rate a sample of outputs to calibrate the rubric
Use that rubric (human-rated or LLM-as-judge) to score candidates
Run the eval on every model change, prompt change, or provider update

This is more work than reading a leaderboard. It is also the only approach that actually predicts performance in your product.

PM Takeaway

Never select a model based solely on benchmark scores. Always run a task-specific eval on your actual use case before committing. Benchmark scores tell you how a model performs in academic settings — not in yours.

The Hardest Problem in AI​

Static Academic Benchmarks​

Human Preference Evaluation​

LLM-as-Judge​

Task-Specific Evaluation​

Further Reading​

Footnotes​

The Hardest Problem in AI

Static Academic Benchmarks

Human Preference Evaluation

LLM-as-Judge

Task-Specific Evaluation

Further Reading

Footnotes