Evaluation Index

Systematically evaluating generative AI systems.

Notes

LLM Evaluation Overview — Metrics, benchmark types, LLM-as-judge pipelines, and evaluation anti-patterns.
Benchmarks and Harness — Standard benchmarks (MMLU, HumanEval, GSM8K) and the lm-evaluation-harness framework.
Evaluating Code Models — pass@k metrics, HumanEval, MBPP, and code-specific evaluation approaches.