Dropout
Definition
Dropout is a regularisation technique that randomly sets each neuron’s output to zero with probability during training, independently for each forward pass. At test time, all neurons are active but their outputs are scaled by to match the expected activation magnitude.
Intuition
Dropout prevents neurons from co-adapting — relying on specific other neurons being present. Every training step produces a “thinned” network (a random sub-network), forcing each neuron to be useful on its own. At test time, using all neurons with scaled outputs approximates averaging over all possible thinned networks — an implicit ensemble.
Think of it as training a team where no member knows who else will show up: every team member must be able to pull their weight independently.
Formal Description
Inverted dropout (standard implementation):
During training, for each layer activation :
The scaling keeps the expected value of equal to , so no test-time correction is needed.
Effect on gradient: the backward pass zeroes out the same units that were zeroed in the forward pass. Only the surviving units receive gradient updates.
# PyTorch: nn.Dropout(p=0.5) applied inside forward()
import torch.nn as nn
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(128, 64)
self.drop = nn.Dropout(p=0.5)
self.fc2 = nn.Linear(64, 10)
def forward(self, x):
return self.fc2(self.drop(torch.relu(self.fc1(x))))
# drop is a no-op at eval time (model.eval())Typical dropout rates:
- Fully connected layers:
- Input layer: –
- Convolutional layers: lower values or no dropout (spatial dropout instead)
- Transformers: applied to attention weights and FFN activations
Dropout as regularisation: the expected loss with dropout is approximately the -regularised loss; specifically, for linear models, dropout is equivalent to an adaptive penalty that is larger for features with higher variance.
MC Dropout (Monte Carlo Dropout): leaving dropout active at test time and running forward passes:
The variance across passes estimates epistemic uncertainty (uncertainty about model parameters). Used in Bayesian deep learning for uncertainty quantification.
Spatial Dropout / DropBlock: drops entire feature map channels (Spatial Dropout) or contiguous spatial regions (DropBlock) — more effective for convolutional networks where adjacent activations are highly correlated.
Stochastic Depth: randomly drops entire residual blocks during training (used in EfficientNet, DeiT). Reduces training time and acts as regulariser.
Applications
| Use case | Notes |
|---|---|
| MLP regularisation | Classic use; for large hidden layers |
| Transformer pre-training | Light dropout on attention/FFN; often |
| Uncertainty estimation | MC Dropout — cheap Bayesian approximation |
| Small datasets | High dropout rate compensates for limited data |
| Convolutional nets | Prefer DropBlock or Stochastic Depth over elementwise dropout |
Trade-offs
- Slows convergence: training effectively requires 2×–3× more steps to compensate for dropped neurons.
- Less effective for batch-normalised networks: BN already provides regularisation; dropout and BN can interact poorly (the variance shift problem).
- Modern large models (transformers) rely more on weight decay and attention dropout than dense dropout.
- Not appropriate for output layers or when computing exact posterior inference.
Links
- Weight Initialization — inverted dropout requires re-tuning learning rate and init scheme
- Batch Normalization — potential conflict: BN normalises statistics that dropout distorts
- Layer Normalization — LN is not affected by dropout the same way BN is
- Generalization Bounds — dropout reduces effective model complexity
- Regularization — dropout is equivalent to adaptive regularisation