LLM Evaluation Overview

Purpose

Evaluation is the engineering discipline of measuring whether a generative AI system is doing what you want it to do. Without systematic evaluation, model changes — prompt updates, fine-tuning, retrieval changes, model swaps — cannot be validated, and regressions are invisible.

Evaluation for LLMs is fundamentally harder than for classical ML:

  • Outputs are open-ended: no single ground truth exists for most generation tasks.
  • Correctness is multi-dimensional: a response can be factually accurate, incoherent, verbose, and subtly harmful simultaneously.
  • Distribution shift is fast: model behaviour changes with prompt wording, temperature, or underlying model version.

An evaluation strategy should answer: is this system meeting user needs? Is it safe? Is it getting better or worse across versions? This requires layering multiple evaluation methods.

Architecture

Evaluation Taxonomy

Automated Metrics (reference-based)

Compare model output to a reference output:

  • Exact match: output must equal reference string exactly; appropriate for constrained generation (SQL, structured JSON, classification labels).
  • BLEU: n-gram precision with brevity penalty; standard for machine translation; known to correlate poorly with human judgement for long-form generation.
  • ROUGE (ROUGE-1, ROUGE-L): recall-oriented n-gram overlap; used for summarisation; same correlation limitations.
  • BERTScore: uses contextual BERT embeddings to compute precision/recall/F1 between generated and reference tokens; better semantic alignment than surface-level n-gram metrics.
  • Perplexity: measures how well the model predicts a held-out text corpus; useful for comparing base model quality but not directly meaningful for task performance.

Human Evaluation

  • Crowdworker evaluation: scalable; cost ~1.00 per rating; high variance; requires careful rubric design and annotator calibration.
  • Expert evaluation: slow and expensive; necessary for high-stakes domains (medical, legal).
  • Preference judgements (pairwise): show annotators two responses (A/B) for the same prompt; ask which is better overall. More reliable than absolute Likert scales; used to produce Elo-ranked leaderboards (Chatbot Arena / LMSYS).
  • Win rate vs a baseline model is the most reliable aggregate human eval signal.

AI-as-a-Judge

Use a strong LLM (GPT-4, Claude 3.5, Llama 3 70B) to evaluate model outputs:

  • Absolute scoring (G-Eval): prompt judge with a rubric; score on 1–5 scale per criterion.
  • Pairwise comparison (MT-Bench / Chatbot Arena style): show judge two responses A and B; ask which is better; aggregate win rates. Reduces position bias by swapping order and averaging.
  • LLM-as-critic with reference: provide the judge with the correct answer or context; ask it to evaluate factuality.

Task-Specific Automated Evaluation

  • Code execution: generate code; run it against unit tests in a sandbox; pass/fail signal (pass@k metric).
  • Tool call success rate: in agentic systems, measure whether the correct tool was called with correct parameters.
  • Factuality checking: extract claims from the output; verify each claim against a knowledge source.
  • Retrieval metrics (RAG): RAGAS framework — context precision, context recall, faithfulness, answer relevancy.

Evaluation Levels

Component-level evaluation assesses individual system parts in isolation:

  • Retriever: recall@k, MRR, NDCG on held-out query-document pairs.
  • Generator: faithfulness and relevance given retrieved context.
  • Reranker: NDCG vs gold relevance labels.

End-to-end evaluation measures the full pipeline on user-representative queries. Critical for identifying cross-component failure modes: a perfect retriever can feed irrelevant context to a perfect generator and still produce bad answers.

Evaluation Dataset Design

  • Golden dataset: curated (prompt, ideal response or label) pairs, human-verified; 100–1000 examples per capability area.
  • Adversarial set: jailbreak attempts, edge cases, ambiguous inputs; verifies safety and robustness.
  • Regression set: cases where previous model versions failed; ensures fixes don’t regress.
  • Distribution-representative set: sampled from actual production traffic (with PII redacted); reflects real user behaviour.

Implementation Notes

AI-as-Judge Methodology

Building a reliable AI-as-judge pipeline:

  1. Choose a strong judge: GPT-4o or Claude 3.5 Sonnet for general tasks; Llama 3 70B-Instruct if you need a self-hosted judge.
  2. Define a structured rubric: avoid vague instructions. Break down evaluation into specific criteria:
    • Faithfulness: does the response contain only claims supported by the provided context?
    • Relevance: does the response directly address the user’s question?
    • Completeness: are all required aspects of the question addressed?
    • Coherence: is the response well-structured and logically consistent?
  3. Pairwise over absolute scoring: absolute scores (1–10) have poor inter-rater reliability. Pairwise (A vs B) is more reliable and produces win-rate statistics.
  4. Control for position bias: the judge tends to prefer whichever response appears first. Mitigate by evaluating both orderings (A,B) and (B,A); average results.
  5. Calibrate against human judgements: measure Pearson/Spearman correlation between judge scores and human scores on a calibration set; target r > 0.7.
# Example judge prompt structure (G-Eval style)
JUDGE_PROMPT = """
You are an expert evaluator. Given the following question and response, 
score the response on Faithfulness (1-5) and Relevance (1-5).
 
Question: {question}
Context: {context}
Response: {response}
 
Output JSON: {{"faithfulness": <int>, "relevance": <int>, "reasoning": "<str>"}}
"""

LangSmith provides a managed eval pipeline: dataset management, experiment tracking, LLM-as-judge integration, and A/B comparison dashboards. For self-hosted pipelines, combine HuggingFace datasets for golden sets with custom evaluation scripts tracked in MLflow or W&B.

Evaluation Cadence

  • Run lightweight automated eval (fast metrics, AI judge on small sample) on every PR/deployment.
  • Run full human eval before major model version changes.
  • Monitor production metrics continuously: user ratings, task completion rates, safety filter activations.

Trade-offs

Human evaluation: highest signal quality; captures nuance and user preference; required for ground-truth calibration. Expensive (5,000 per study), slow (days to weeks), and hard to reproduce.

Automated reference metrics (BLEU/ROUGE): fast and reproducible; well understood. Correlate poorly with human preference for open-ended generation; mislead on rephrased-but-correct outputs.

AI-as-judge: scales to thousands of examples cheaply; often correlates well with human judgement (r ≈ 0.7–0.9 with GPT-4 judge). Has systematic biases: prefers longer responses (verbosity bias), prefers its own outputs (self-preference bias), sensitive to prompt wording. Do not use as a sole evaluation signal.

Execution-based (code, tool calls): objective, reproducible, no human or judge required. Only applicable to constrained output spaces; requires maintained test suites.

References

  • Liu et al. (2023). G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. ACL 2023.
  • Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. LMSYS.
  • Es et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv.
  • Zhang et al. (2019). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
  • Chiang et al. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.