Residual Connections
Definition
A residual connection (skip connection) adds the input of a sub-layer directly to its output:
where is the sub-layer transformation (e.g., a convolutional block or a transformer FFN) and is the identity shortcut. The block learns the residual rather than a direct mapping.
Intuition
Training very deep networks ( layers) with plain stacked layers was empirically nearly impossible before residual connections — both gradients and activations degraded. The residual shortcut creates a gradient highway: during backpropagation, gradient can flow directly through the identity path without passing through any non-linearities. This solves the vanishing gradient problem for depth.
The “residual” framing is also conceptually appealing: if the identity is already a good approximation, only needs to learn small corrections, which is easier than learning the full mapping from scratch.
Formal Description
ResNet block (He et al., 2016):
When dimensions don’t match (different channel/feature sizes), a projection shortcut is used:
Gradient flow analysis: backpropagation through a residual block:
The “1” term means gradient flows to layer directly from any deeper layer without multiplicative shrinkage:
The product can no longer vanish as long as the individual terms are small.
Ensemble interpretation (Veit et al., 2016): unrolling a residual network with blocks produces paths through the network. The output is effectively an ensemble of networks of varying depth. Most of the gradient flows through shorter paths, making the network robust to depth.
Transformer residual structure:
# Pre-LN transformer block
y = x + Attention(LN(x))
z = y + FFN(LN(y))
Every transformer block has two residual connections: one around the attention sub-layer and one around the FFN. This is why transformers can scale to hundreds of layers.
Dense connections (DenseNet): every layer receives input from all previous layers:
More extreme form of skip connections; useful for tasks requiring multi-scale features (segmentation, detection).
Highway networks: learned gating of the skip connection:
where is a trainable gate. Precursor to ResNets; less popular because ungated residuals work equally well.
Applications
| Architecture | Residual variant |
|---|---|
| ResNet (image classification) | Standard blocks |
| Transformer (all LLMs) | Two residuals per block (attn + FFN) |
| U-Net (segmentation) | Skip connections between encoder/decoder |
| DenseNet | Dense skip connections across all layers |
| EfficientNet | MBConv blocks with residuals |
Trade-offs
- Residual connections add essentially zero parameter cost (the identity shortcut has no parameters unless a projection is needed).
- Memory cost: during backpropagation, activations at each residual junction must be stored. Gradient checkpointing trades memory for recompute.
- For very shallow networks (2–3 layers), residuals provide minimal benefit.
- Pre-LN placement combined with residuals is currently the most stable configuration for large language models.
Links
- Layer Normalization — always paired with residuals in transformers (Pre-LN or Post-LN)
- Batch Normalization — paired with residuals in ResNets (BN→ReLU→Conv→BN→ReLU→Conv + shortcut)
- Backpropagation — residuals fundamentally change the gradient flow computation graph
- Weight Initialization — initialise close to zero so residual block starts as near-identity
- Activation Functions — GELU/ReLU used inside residual blocks