Loss Functions

Definition

A loss function (objective function) measures the discrepancy between model predictions and ground truth labels for a single example. The training objective is to minimize the average loss over the dataset, possibly with regularization terms.

Intuition

The choice of loss function encodes your assumptions about the error structure and what kinds of mistakes are costly. Squared error penalizes large residuals heavily; absolute error is robust to outliers; cross-entropy penalizes confident wrong predictions more than timid ones; hinge loss only cares whether the correct class wins by a margin.

Formal Description

Regression Losses

Mean Squared Error (MSE / L2 loss)

Differentiable everywhere; penalizes outliers heavily; equivalent to MLE under Gaussian noise assumption.

Mean Absolute Error (MAE / L1 loss)

Robust to outliers; non-differentiable at zero (subgradient used in practice); equivalent to MLE under Laplace noise assumption.

Huber loss interpolates: uses L2 for small residuals () and L1 for large ones, combining differentiability with outlier robustness.


Classification Losses

Binary Cross-Entropy (log loss)

Paired with sigmoid output. Minimizing is equivalent to maximum likelihood under Bernoulli distribution.

Multi-class Cross-Entropy

The softmax + CE gradient has a clean form: . Cross-entropy is equal to KL divergence from the empirical distribution plus an entropy constant, so minimizing CE is equivalent to minimizing .

Hinge loss (SVM)

Only penalizes predictions within margin of the decision boundary. Correct predictions beyond the margin incur zero loss. Multiclass extension (Weston-Watkins): .

Focal loss

Extension of binary CE that down-weights easy, well-classified examples (small loss) and focuses learning on hard examples. Parameters: (focusing), (class balancing). Originally proposed for object detection with severe class imbalance (RetinaNet). At it reduces to weighted CE.


Practical Notes

  • Class imbalance: use focal loss, per-class weighting, or oversampling with CE
  • Numerical stability: compute softmax + CE together via log-sum-exp trick; never compute naively
  • Ordinal targets: consider ranked loss or distance-aware objectives rather than plain CE

Applications

LossTypical use
MSERegression, autoencoders (output layer)
MAERobust regression, median estimation
Binary CEBinary classification, sigmoid outputs
Multi-class CEMulti-class classification, language modeling
HingeSVMs, margin-based classifiers
FocalObject detection, severe class imbalance

Trade-offs

  • MSE is differentiable and analytically convenient but can dominate training when large outliers exist
  • MAE/Huber improve robustness but require careful tuning for Huber
  • CE assumes well-calibrated probabilities; can overfit to label noise
  • Focal loss adds two hyperparameters () that require tuning