Cross-Entropy Loss

Definition

The negative log-likelihood under a categorical distribution; measures the expected surprise of predictions relative to true labels.

Intuition

The log penalizes confident wrong predictions heavily — if the model assigns near-zero probability to the true class, the loss explodes. Minimizing cross-entropy is equivalent to maximum likelihood estimation for categorical models.

Formal Description

Binary CE (paired with sigmoid output):

$ℓ (\overset{y}{^}, y) = - [y lo g \overset{y}{^} + (1 - y) lo g (1 - \overset{y}{^})]$

Multi-class CE (paired with softmax):

$ℓ (\overset{y}{^}, y) = - \sum_{k = 1}^{K} y_{k} lo g \overset{y}{^}_{k}, \overset{y}{^} = softmax (z), softmax (z)_{k} = \frac{e ^{z_{k}}}{\sum _{j} e ^{z_{j}}}$

Combined gradient (softmax + CE): unusually clean — the upstream gradient is simply the residual:

$\frac{\partial ℓ}{\partial z _{k}} = \overset{y}{^}_{k} - y_{k}$

Dataset loss:

$J = \frac{1}{m} \sum_{i = 1}^{m} ℓ (\overset{y}{^}^{(i)}, y^{(i)})$

Connection to KL divergence:

$H (p, q) = H (p) + D_{KL} (p ∥ q)$

Since $H (p)$ is fixed given the data, minimizing cross-entropy $H (p, q)$ is equivalent to minimizing $D_{KL} (p ∥ q)$ — i.e., making the model distribution close to the true distribution.

Applications

Binary and multi-class classification
Language modeling (next-token prediction)
Any task with categorical outputs

Trade-offs

Sensitive to class imbalance — minority-class errors contribute little to the average; consider focal loss or per-class reweighting
Numerically unstable if $\overset{y}{^} \to 0$ ; use the log-sum-exp trick when computing softmax + CE together
The clean softmax-CE gradient makes implementation straightforward but obscures what happens numerically

Notes

Explorer

cross_entropy_loss

Cross-Entropy Loss

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks