Evaluation Metrics
Definition
Scalar performance measures used to guide model selection and compare systems; properly structured metrics enable fast iteration.
Intuition
If you cannot rank two systems with a single number, iteration slows down; the key insight is to separate what you maximize (optimizing metric) from what must merely be acceptable (satisficing metrics).
Formal Description
Single-number metric: any scalar that ranks candidate models — accuracy, F1, AUC, BLEU, WER, etc.; choose one primary metric per project and don’t average incomparable metrics without care.
Satisficing vs optimizing:
- Pick one metric to maximize (optimizing)
- Set threshold constraints on others (satisficing)
- Example: maximize accuracy s.t. inference latency < 100 ms, model size < 50 MB
Trade-off surfaces: when two metrics are genuinely in tension (precision–recall, accuracy–latency), the satisficing framework makes the trade-off explicit rather than hiding it in a weighted sum.
When to combine metrics: a weighted combination is reasonable if relative importance is stable across the project lifetime; otherwise prefer satisficing constraints.
Applications
- NLP: BLEU, ROUGE, perplexity
- Vision: top-1 accuracy, mAP
- Speech: WER
- Production systems: latency + accuracy jointly constrained
Trade-offs
- Any single metric is a proxy and can be gamed
- Metrics should be re-evaluated as project requirements mature
- Business constraints (latency, cost, fairness) often dominate raw accuracy in deployment