LLM Evaluation Overview
Purpose
Evaluation is the engineering discipline of measuring whether a generative AI system is doing what you want it to do. Without systematic evaluation, model changes — prompt updates, fine-tuning, retrieval changes, model swaps — cannot be validated, and regressions are invisible.
Evaluation for LLMs is fundamentally harder than for classical ML:
- Outputs are open-ended: no single ground truth exists for most generation tasks.
- Correctness is multi-dimensional: a response can be factually accurate, incoherent, verbose, and subtly harmful simultaneously.
- Distribution shift is fast: model behaviour changes with prompt wording, temperature, or underlying model version.
An evaluation strategy should answer: is this system meeting user needs? Is it safe? Is it getting better or worse across versions? This requires layering multiple evaluation methods.
Architecture
Evaluation Taxonomy
Automated Metrics (reference-based)
Compare model output to a reference output:
- Exact match: output must equal reference string exactly; appropriate for constrained generation (SQL, structured JSON, classification labels).
- BLEU: n-gram precision with brevity penalty; standard for machine translation; known to correlate poorly with human judgement for long-form generation.
- ROUGE (ROUGE-1, ROUGE-L): recall-oriented n-gram overlap; used for summarisation; same correlation limitations.
- BERTScore: uses contextual BERT embeddings to compute precision/recall/F1 between generated and reference tokens; better semantic alignment than surface-level n-gram metrics.
- Perplexity: measures how well the model predicts a held-out text corpus; useful for comparing base model quality but not directly meaningful for task performance.
Human Evaluation
- Crowdworker evaluation: scalable; cost ~1.00 per rating; high variance; requires careful rubric design and annotator calibration.
- Expert evaluation: slow and expensive; necessary for high-stakes domains (medical, legal).
- Preference judgements (pairwise): show annotators two responses (A/B) for the same prompt; ask which is better overall. More reliable than absolute Likert scales; used to produce Elo-ranked leaderboards (Chatbot Arena / LMSYS).
- Win rate vs a baseline model is the most reliable aggregate human eval signal.
AI-as-a-Judge
Use a strong LLM (GPT-4, Claude 3.5, Llama 3 70B) to evaluate model outputs:
- Absolute scoring (G-Eval): prompt judge with a rubric; score on 1–5 scale per criterion.
- Pairwise comparison (MT-Bench / Chatbot Arena style): show judge two responses A and B; ask which is better; aggregate win rates. Reduces position bias by swapping order and averaging.
- LLM-as-critic with reference: provide the judge with the correct answer or context; ask it to evaluate factuality.
Task-Specific Automated Evaluation
- Code execution: generate code; run it against unit tests in a sandbox; pass/fail signal (pass@k metric).
- Tool call success rate: in agentic systems, measure whether the correct tool was called with correct parameters.
- Factuality checking: extract claims from the output; verify each claim against a knowledge source.
- Retrieval metrics (RAG): RAGAS framework — context precision, context recall, faithfulness, answer relevancy.
Evaluation Levels
Component-level evaluation assesses individual system parts in isolation:
- Retriever: recall@k, MRR, NDCG on held-out query-document pairs.
- Generator: faithfulness and relevance given retrieved context.
- Reranker: NDCG vs gold relevance labels.
End-to-end evaluation measures the full pipeline on user-representative queries. Critical for identifying cross-component failure modes: a perfect retriever can feed irrelevant context to a perfect generator and still produce bad answers.
Evaluation Dataset Design
- Golden dataset: curated (prompt, ideal response or label) pairs, human-verified; 100–1000 examples per capability area.
- Adversarial set: jailbreak attempts, edge cases, ambiguous inputs; verifies safety and robustness.
- Regression set: cases where previous model versions failed; ensures fixes don’t regress.
- Distribution-representative set: sampled from actual production traffic (with PII redacted); reflects real user behaviour.
Implementation Notes
AI-as-Judge Methodology
Building a reliable AI-as-judge pipeline:
- Choose a strong judge: GPT-4o or Claude 3.5 Sonnet for general tasks; Llama 3 70B-Instruct if you need a self-hosted judge.
- Define a structured rubric: avoid vague instructions. Break down evaluation into specific criteria:
- Faithfulness: does the response contain only claims supported by the provided context?
- Relevance: does the response directly address the user’s question?
- Completeness: are all required aspects of the question addressed?
- Coherence: is the response well-structured and logically consistent?
- Pairwise over absolute scoring: absolute scores (1–10) have poor inter-rater reliability. Pairwise (A vs B) is more reliable and produces win-rate statistics.
- Control for position bias: the judge tends to prefer whichever response appears first. Mitigate by evaluating both orderings (A,B) and (B,A); average results.
- Calibrate against human judgements: measure Pearson/Spearman correlation between judge scores and human scores on a calibration set; target r > 0.7.
# Example judge prompt structure (G-Eval style)
JUDGE_PROMPT = """
You are an expert evaluator. Given the following question and response,
score the response on Faithfulness (1-5) and Relevance (1-5).
Question: {question}
Context: {context}
Response: {response}
Output JSON: {{"faithfulness": <int>, "relevance": <int>, "reasoning": "<str>"}}
"""LangSmith provides a managed eval pipeline: dataset management, experiment tracking, LLM-as-judge integration, and A/B comparison dashboards. For self-hosted pipelines, combine HuggingFace datasets for golden sets with custom evaluation scripts tracked in MLflow or W&B.
Evaluation Cadence
- Run lightweight automated eval (fast metrics, AI judge on small sample) on every PR/deployment.
- Run full human eval before major model version changes.
- Monitor production metrics continuously: user ratings, task completion rates, safety filter activations.
Trade-offs
Human evaluation: highest signal quality; captures nuance and user preference; required for ground-truth calibration. Expensive (5,000 per study), slow (days to weeks), and hard to reproduce.
Automated reference metrics (BLEU/ROUGE): fast and reproducible; well understood. Correlate poorly with human preference for open-ended generation; mislead on rephrased-but-correct outputs.
AI-as-judge: scales to thousands of examples cheaply; often correlates well with human judgement (r ≈ 0.7–0.9 with GPT-4 judge). Has systematic biases: prefers longer responses (verbosity bias), prefers its own outputs (self-preference bias), sensitive to prompt wording. Do not use as a sole evaluation signal.
Execution-based (code, tool calls): objective, reproducible, no human or judge required. Only applicable to constrained output spaces; requires maintained test suites.
References
- Liu et al. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. ACL 2023.
- Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. LMSYS.
- Es et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
- Zhang et al. (2019). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
- Chiang et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.