Model Evaluation & Benchmarks
Before reading: you should understand loss functions, train/val/test splits, and basic ML training — all covered in Machine Learning.
“Why does my model look great in the playground but fail in production?” You ran a few prompts, the outputs looked reasonable, and you shipped it. Then users started reporting nonsense answers, biased completions, and confident hallucinations.
Evaluation is the difference between “looks good to me” and knowing your model works. Without it, every change — a new fine-tune, a different prompt format, a bigger model — is a coin flip.
Perplexity
Perplexity measures how “surprised” a model is by text it hasn’t seen. It’s the exponentiated average negative log-likelihood of each token:
Perplexity = exp(-1/N * Σ log P(token_i | token_1...token_{i-1}))
Lower perplexity = the model assigns higher probability to the correct next token = it better predicts the test data. A perplexity of 10 means the model is as uncertain as if choosing uniformly among 10 equally likely options at each step.
Perplexity is fast, automatic, and reproducible — no human needed. It’s the standard metric during pre-training and fine-tuning for tracking whether loss is still decreasing.
But it’s insufficient alone. Perplexity rewards a model for being good at next-token prediction on its training distribution. It does not capture factual accuracy, reasoning ability, helpfulness, safety, or instruction following. A model can have low perplexity and still generate confident nonsense. Use perplexity to monitor training, not to evaluate quality.
Generation Metrics
For tasks with a reference output (translation, summarization, question answering), these metrics compare generated text against a human-written reference:
| Metric | What It Measures | Good For | Blind Spots |
|---|---|---|---|
| BLEU | n-gram overlap with reference(s) | Machine translation | Penalizes valid synonyms, ignores semantics |
| ROUGE | Recall of n-grams from reference | Summarization (did we cover the key points?) | Long outputs score higher regardless of quality |
| METEOR | Unigram precision + recall + synonym matching | Translation, better correlation with human judgment than BLEU | Slower, language-dependent synonym sets |
| BERTScore | Cosine similarity of BERT embeddings between generated and reference | Any generation task | Requires a strong embedding model, computationally heavier |
All n-gram metrics share a fundamental problem: they compare surface form, not meaning. “The cat sat on the mat” and “A feline rested upon the rug” share zero n-grams but are semantically identical. BERTScore partially addresses this by operating in embedding space, where synonyms are close.
LLM-as-Judge
Instead of n-gram overlap, use a strong LLM to score outputs. The judge model rates each response on dimensions like helpfulness, accuracy, relevance, and safety.
MT-Bench — A multi-turn benchmark where GPT-4 scores model responses on a 1–10 scale across 80 questions in 8 categories (writing, reasoning, math, coding, extraction, STEM, humanities, roleplay). GPT-4 judgments correlate well with human preference rankings.
Chatbot Arena (LMSYS) — Users submit a prompt, two anonymous models respond, the user votes for the better response. Over 1 million human preference votes collected. Models are ranked using Elo scores — the same system used in chess.
The key insight: strong LLMs are decent evaluators, but they have biases. They prefer longer responses, responses from their own model family, and responses that appear confident. Always validate judge-model evaluations against human judgments on a subset of your data.
Elo Ratings & Leaderboards
Elo ratings convert pairwise preference data (A beats B) into a global ranking:
- Every model starts with the same rating (e.g., 1500).
- When model A beats model B, A gains points from B proportional to how surprising the outcome was.
- A model expected to win (higher Elo) gains few points for winning and loses many for losing.
- Over thousands of comparisons, scores stabilize and reflect relative strength.
Chatbot Arena maintains the most widely used LLM Elo leaderboard. It’s not perfect — different user populations (developer vs general public) produce different rankings — but it’s the closest thing to a ground-truth leaderboard we have.
Benchmark Suite
| Benchmark | Task | Format | Metric | Why It Matters |
|---|---|---|---|---|
| MMLU | 57 subjects (law, medicine, math, history) | Multiple choice | Accuracy | Broad knowledge — the SAT for LLMs |
| HumanEval | Python function completion from docstring | Code generation | pass@k | Measures coding ability, not recall |
| SWE-bench | Real GitHub issue → fix + PR | Software engineering | % resolved | Closest to real-world SWE work |
| GSM8K | Grade-school math word problems | Step-by-step reasoning | Final answer accuracy | Multi-step reasoning, easy to verify |
| HellaSwag | Pick the most plausible sentence ending | Multiple choice | Accuracy | Commonsense reasoning, hard for models |
| MATH | Competition-level math (AMC/AIME) | Step-by-step reasoning | Final answer accuracy | Frontier reasoning ability |
| ARC-Challenge | Grade-school science questions | Multiple choice | Accuracy | Tests reasoning, not retrieval |
| TruthfulQA | Questions designed to trigger false beliefs | Free-text generation | Truthfulness (judge-model) | Measures hallucination resistance |
Benchmarks measure specific capabilities, not overall quality. A model can ace HumanEval and still generate terrible code review feedback. Aggregate scores hide weakness in domains you care about. Pick benchmarks that match your use case — don’t chase leaderboard position.
Human Evaluation
When benchmarks and judge-models aren’t enough, you need humans:
Inter-annotator agreement — If two humans disagree on whether a response is good, the evaluation rubric is underspecified. Measure agreement with Cohen’s kappa or Krippendorff’s alpha. Values below 0.6 mean your evaluation criteria need work, not your model.
A/B preference — Show two responses side by side. “Which is more helpful?” The gold standard for comparing models. Cheaper and more reliable than absolute ratings because humans are better at relative judgment than absolute scoring.
Likert scales — Rate on 1–5 scale: “How coherent is this response?” Problematic because raters cluster differently (one person’s 3 is another’s 4) and disagree on what “coherent” means. Prefer A/B testing over Likert when possible.
Human evaluation is expensive, slow, and noisy. It does not scale. Use it to validate automated metrics and judge-models, then let those automated systems carry the evaluation load.
Key Things
- Perplexity is for training monitoring, not quality evaluation. Low perplexity ≠ good model. It means the model memorized its training distribution, not that it reasons or follows instructions.
- N-gram metrics (BLEU, ROUGE) are legacy. They penalize valid paraphrases and ignore semantics. BERTScore or LLM-as-judge are better defaults for generation evaluation.
- Judge-models have biases. They prefer longer answers and answers from related model families. Calibrate against human judgment on your data.
- Benchmarks measure capabilities, not overall quality. A high MMLU score doesn’t mean the model is safe, helpful, or good at your specific task. Run your own eval.
- Goodhart’s law applies. When a metric becomes a target, it ceases to be a good metric. Training on benchmark data or optimizing prompts for a leaderboard inflates scores without improving real performance.
- Human eval validates automated eval. Not the other way around. If you can’t afford human eval, at minimum run LLM-as-judge and check a random sample of judgments manually.
References
- BLEU: Papineni et al., 2002 — BLEU: a Method for Automatic Evaluation of Machine Translation
- ROUGE: Lin, 2004 — ROUGE: A Package for Automatic Evaluation of Summaries
- BERTScore: Zhang et al., 2020 — BERTScore: Evaluating Text Generation with BERT — arXiv
- MT-Bench: Zheng et al., 2024 — Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — arXiv
- Chatbot Arena: Chiang et al., 2024 — Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference — arXiv
- MMLU: Hendrycks et al., 2021 — Measuring Massive Multitask Language Understanding — arXiv
- HumanEval: Chen et al., 2021 — Evaluating Large Language Models Trained on Code — arXiv
- SWE-bench: Jimenez et al., 2024 — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — arXiv
- GSM8K: Cobbe et al., 2021 — Training Verifiers to Solve Math Word Problems — arXiv
- TruthfulQA: Lin et al., 2022 — TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv