Notes

Home

❯

06_ai_engineering

❯

02_evaluation

index

Mar 05, 20261 min read

Evaluation Index

Systematically evaluating generative AI systems.

Notes

  • LLM Evaluation Overview — Metrics, benchmark types, LLM-as-judge pipelines, and evaluation anti-patterns.
  • Benchmarks and Harness — Standard benchmarks (MMLU, HumanEval, GSM8K) and the lm-evaluation-harness framework.
  • Evaluating Code Models — pass@k metrics, HumanEval, MBPP, and code-specific evaluation approaches.

Navigation

← Prev ← Foundation Models | Next → Prompt Engineering →

Links

  • AI Engineering

3 items under this folder.

  • Mar 05, 2026

    benchmarks_and_harness

    • evaluation
  • Mar 05, 2026

    evaluating_code_models

    • evaluation
    • generation
  • Mar 05, 2026

    llm_evaluation_overview

    • evaluation

Created with Quartz v4.5.2 © 2026

  • GitHub