Evaluation Metrics

Definition

Scalar performance measures used to guide model selection and compare systems; properly structured metrics enable fast iteration.

Intuition

If you cannot rank two systems with a single number, iteration slows down; the key insight is to separate what you maximize (optimizing metric) from what must merely be acceptable (satisficing metrics).

Formal Description

Single-number metric: any scalar that ranks candidate models — accuracy, F1, AUC, BLEU, WER, etc.; choose one primary metric per project and don’t average incomparable metrics without care.

Satisficing vs optimizing:

Pick one metric to maximize (optimizing)
Set threshold constraints on others (satisficing)
Example: maximize accuracy s.t. inference latency < 100 ms, model size < 50 MB

Trade-off surfaces: when two metrics are genuinely in tension (precision–recall, accuracy–latency), the satisficing framework makes the trade-off explicit rather than hiding it in a weighted sum.

When to combine metrics: a weighted combination $J = α \cdot accuracy - β \cdot latency$ is reasonable if relative importance is stable across the project lifetime; otherwise prefer satisficing constraints.

Applications

NLP: BLEU, ROUGE, perplexity
Vision: top-1 accuracy, mAP
Speech: WER
Production systems: latency + accuracy jointly constrained

Trade-offs

Any single metric is a proxy and can be gamed
Metrics should be re-evaluated as project requirements mature
Business constraints (latency, cost, fairness) often dominate raw accuracy in deployment

Notes

Explorer

evaluation_metrics

Evaluation Metrics

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks