Loss Functions

Definition

A loss function (objective function) measures the discrepancy between model predictions and ground truth labels for a single example. The training objective is to minimize the average loss over the dataset, possibly with regularization terms.

Intuition

The choice of loss function encodes your assumptions about the error structure and what kinds of mistakes are costly. Squared error penalizes large residuals heavily; absolute error is robust to outliers; cross-entropy penalizes confident wrong predictions more than timid ones; hinge loss only cares whether the correct class wins by a margin.

Formal Description

Regression Losses

Mean Squared Error (MSE / L2 loss)

ℓ (\overset{y}{^}, y) = (\overset{y}{^} - y)^{2}, J = \frac{1}{m} i \sum (\overset{y}{^}^{(i)} - y^{(i)})^{2}

Differentiable everywhere; penalizes outliers heavily; equivalent to MLE under Gaussian noise assumption.

Mean Absolute Error (MAE / L1 loss)

ℓ (\overset{y}{^}, y) = ∣ \overset{y}{^} - y ∣

Robust to outliers; non-differentiable at zero (subgradient used in practice); equivalent to MLE under Laplace noise assumption.

Huber loss interpolates: uses L2 for small residuals ( $∣ r ∣ \leq δ$ ) and L1 for large ones, combining differentiability with outlier robustness.

Classification Losses

Binary Cross-Entropy (log loss)

ℓ (\overset{y}{^}, y) = - [y lo g \overset{y}{^} + (1 - y) lo g (1 - \overset{y}{^})]

Paired with sigmoid output. Minimizing is equivalent to maximum likelihood under Bernoulli distribution.

Multi-class Cross-Entropy

ℓ (\overset{y}{^}, y) = - k = 1 \sum K y_{k} lo g \overset{y}{^}_{k}, \overset{y}{^} = softmax (z)

The softmax + CE gradient has a clean form: $\partial ℓ / \partial z_{k} = \overset{y}{^}_{k} - y_{k}$ . Cross-entropy is equal to KL divergence from the empirical distribution plus an entropy constant, so minimizing CE is equivalent to minimizing $D_{KL} (p ∥ q)$ .

Hinge loss (SVM)

ℓ (\overset{y}{^}, y) = max (0, 1 - y \overset{y}{^}), y \in {- 1, + 1}

Only penalizes predictions within margin $= 1$ of the decision boundary. Correct predictions beyond the margin incur zero loss. Multiclass extension (Weston-Watkins): $\sum_{j \neq = y_{i}} max (0, \overset{y}{^}_{j} - \overset{y}{^}_{y_{i}} + 1)$ .

Focal loss

ℓ (\overset{y}{^}, y) = - α_{t} (1 - \overset{y}{^}_{t})^{γ} lo g (\overset{y}{^}_{t})

Extension of binary CE that down-weights easy, well-classified examples (small loss) and focuses learning on hard examples. Parameters: $γ > 0$ (focusing), $α_{t}$ (class balancing). Originally proposed for object detection with severe class imbalance (RetinaNet). At $γ = 0$ it reduces to weighted CE.

Practical Notes

Class imbalance: use focal loss, per-class weighting, or oversampling with CE
Numerical stability: compute softmax + CE together via log-sum-exp trick; never compute $lo g (softmax (z))$ naively
Ordinal targets: consider ranked loss or distance-aware objectives rather than plain CE

Applications

Loss	Typical use
MSE	Regression, autoencoders (output layer)
MAE	Robust regression, median estimation
Binary CE	Binary classification, sigmoid outputs
Multi-class CE	Multi-class classification, language modeling
Hinge	SVMs, margin-based classifiers
Focal	Object detection, severe class imbalance

Trade-offs

MSE is differentiable and analytically convenient but can dominate training when large outliers exist
MAE/Huber improve robustness but require careful $δ$ tuning for Huber
CE assumes well-calibrated probabilities; can overfit to label noise
Focal loss adds two hyperparameters ( $α, γ$ ) that require tuning

Notes

Explorer

loss_functions

Loss Functions

Definition

Intuition

Formal Description

Regression Losses

Classification Losses

Practical Notes

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks