Skip to main content

Evaluation and Benchmarks

PM: Read in full โ€” 20 min

The Hardest Problem in AIโ€‹

Evaluating language models is harder than evaluating almost any other software system. The output is open-ended text. There is often no single correct answer. And because models train on internet data, any benchmark that appears publicly can contaminate the training set โ€” making its score meaningless.

Despite this, benchmarks are everywhere. Understanding what they measure, what they don't, and how to read leaderboards critically is a practical skill for anyone shipping AI products.

Static Academic Benchmarksโ€‹

These consist of fixed test sets with multiple-choice or short-answer questions, scored automatically.

MMLU (Massive Multitask Language Understanding) โ€” 57 subject areas from elementary math to professional law, 14,000+ questions. The most cited general-knowledge benchmark. Frontier models now score above 85%, and the benchmark is widely considered saturated.

HumanEval / MBPP โ€” coding benchmarks that measure whether generated code passes test cases. More objective than language tasks (code either runs or doesn't). Frontier models score 85โ€“95%.

GSM8K / MATH โ€” grade-school and competition math. Tests multi-step reasoning. Still differentiating for the hardest problems.

BIG-Bench Hard โ€” a subset of a 200+ task suite focused on tasks where chain-of-thought prompting makes a measurable difference. Designed to avoid saturation by including tasks that required reasoning beyond surface-level pattern matching.

Limitations:

  • Scores inflate as models train on data that includes the benchmark questions (data contamination)
  • Performance on a benchmark โ‰  performance on your actual use case
  • They measure specific capability slices, not general quality

Human Preference Evaluationโ€‹

Chatbot Arena (LMSYS) โ€” real users conduct blind side-by-side conversations with two anonymous models and vote for the better response. Winners accumulate ELO scores. Currently the most trusted signal for real-world quality because:

  • Real users, real queries, real preferences
  • Models don't know they're being evaluated, preventing gaming
  • ELO is robust to selective reporting

Tradeoff: Arena rankings lag โ€” new models need weeks to accumulate stable scores.

MT-Bench โ€” multi-turn conversations across 8 categories (writing, reasoning, math, coding, roleplay, extraction, STEM, humanities), rated by LLM-as-judge. Faster than Arena for initial quality signals.

LLM-as-Judgeโ€‹

A pattern where you use a capable LLM (GPT-4, Claude) to evaluate another model's outputs against a rubric. Scales better than human evaluation but inherits biases:

  • Judges prefer longer responses (length bias)
  • Judges prefer responses matching their own style (self-enhancement bias)
  • Judges are inconsistent on subjective criteria

Best used for ranking relative quality, checking format adherence, and flagging obvious failures. Not reliable as a sole quality signal for nuanced evaluation.

Task-Specific Evaluationโ€‹

For production applications, industry benchmarks are almost never the right measure. What matters is: does the model reliably do the specific thing your product needs?

Build your own eval:

  1. Collect 100โ€“500 representative real examples from your use case
  2. Define a rubric: what does "good" look like for each example type?
  3. Have humans rate a sample of outputs to calibrate the rubric
  4. Use that rubric (human-rated or LLM-as-judge) to score candidates
  5. Run the eval on every model change, prompt change, or provider update

This is more work than reading a leaderboard. It is also the only approach that actually predicts performance in your product.

PM Takeaway

Never select a model based solely on benchmark scores. Always run a task-specific eval on your actual use case before committing. Benchmark scores tell you how a model performs in academic settings โ€” not in yours.

Further Readingโ€‹

Hendrycks, D. et al. โ€” ICLR 2021 (2020)
Read:Abstract.Skip:Everything else.
MMLU: 57-subject multiple-choice test from elementary math to professional law โ€” the most cited benchmark for general LLM knowledge. Now largely saturated by frontier models.
Zheng, L. et al. โ€” NeurIPS 2023 (2023)
Read:Abstract, Section 1.Skip:Technical sections.
Chatbot Arena (LMSYS) uses human blind A/B comparisons to rank models by ELO โ€” currently the most trusted real-world quality signal for generative AI.