Cross-Entropy Loss

Definition

The negative log-likelihood under a categorical distribution; measures the expected surprise of predictions relative to true labels.

Intuition

The log penalizes confident wrong predictions heavily — if the model assigns near-zero probability to the true class, the loss explodes. Minimizing cross-entropy is equivalent to maximum likelihood estimation for categorical models.

Formal Description

Binary CE (paired with sigmoid output):

Multi-class CE (paired with softmax):

Combined gradient (softmax + CE): unusually clean — the upstream gradient is simply the residual:

Dataset loss:

Connection to KL divergence:

Since is fixed given the data, minimizing cross-entropy is equivalent to minimizing — i.e., making the model distribution close to the true distribution.

Applications

  • Binary and multi-class classification
  • Language modeling (next-token prediction)
  • Any task with categorical outputs

Trade-offs

  • Sensitive to class imbalance — minority-class errors contribute little to the average; consider focal loss or per-class reweighting
  • Numerically unstable if ; use the log-sum-exp trick when computing softmax + CE together
  • The clean softmax-CE gradient makes implementation straightforward but obscures what happens numerically