Cross-Entropy Loss
Definition
The negative log-likelihood under a categorical distribution; measures the expected surprise of predictions relative to true labels.
Intuition
The log penalizes confident wrong predictions heavily — if the model assigns near-zero probability to the true class, the loss explodes. Minimizing cross-entropy is equivalent to maximum likelihood estimation for categorical models.
Formal Description
Binary CE (paired with sigmoid output):
Multi-class CE (paired with softmax):
Combined gradient (softmax + CE): unusually clean — the upstream gradient is simply the residual:
Dataset loss:
Connection to KL divergence:
Since is fixed given the data, minimizing cross-entropy is equivalent to minimizing — i.e., making the model distribution close to the true distribution.
Applications
- Binary and multi-class classification
- Language modeling (next-token prediction)
- Any task with categorical outputs
Trade-offs
- Sensitive to class imbalance — minority-class errors contribute little to the average; consider focal loss or per-class reweighting
- Numerically unstable if ; use the log-sum-exp trick when computing softmax + CE together
- The clean softmax-CE gradient makes implementation straightforward but obscures what happens numerically