Evaluation Index
Systematically evaluating generative AI systems.
Notes
- LLM Evaluation Overview — Metrics, benchmark types, LLM-as-judge pipelines, and evaluation anti-patterns.
- Benchmarks and Harness — Standard benchmarks (MMLU, HumanEval, GSM8K) and the lm-evaluation-harness framework.
- Evaluating Code Models — pass@k metrics, HumanEval, MBPP, and code-specific evaluation approaches.
Navigation
← Prev ← Foundation Models | Next → Prompt Engineering →