Residual Connections

Definition

A residual connection (skip connection) adds the input of a sub-layer directly to its output:

where is the sub-layer transformation (e.g., a convolutional block or a transformer FFN) and is the identity shortcut. The block learns the residual rather than a direct mapping.

Intuition

Training very deep networks ( layers) with plain stacked layers was empirically nearly impossible before residual connections — both gradients and activations degraded. The residual shortcut creates a gradient highway: during backpropagation, gradient can flow directly through the identity path without passing through any non-linearities. This solves the vanishing gradient problem for depth.

The “residual” framing is also conceptually appealing: if the identity is already a good approximation, only needs to learn small corrections, which is easier than learning the full mapping from scratch.

Formal Description

ResNet block (He et al., 2016):

When dimensions don’t match (different channel/feature sizes), a projection shortcut is used:

Gradient flow analysis: backpropagation through a residual block:

The “1” term means gradient flows to layer directly from any deeper layer without multiplicative shrinkage:

The product can no longer vanish as long as the individual terms are small.

Ensemble interpretation (Veit et al., 2016): unrolling a residual network with blocks produces paths through the network. The output is effectively an ensemble of networks of varying depth. Most of the gradient flows through shorter paths, making the network robust to depth.

Transformer residual structure:

# Pre-LN transformer block
y = x + Attention(LN(x))
z = y + FFN(LN(y))

Every transformer block has two residual connections: one around the attention sub-layer and one around the FFN. This is why transformers can scale to hundreds of layers.

Dense connections (DenseNet): every layer receives input from all previous layers:

More extreme form of skip connections; useful for tasks requiring multi-scale features (segmentation, detection).

Highway networks: learned gating of the skip connection:

where is a trainable gate. Precursor to ResNets; less popular because ungated residuals work equally well.

Applications

ArchitectureResidual variant
ResNet (image classification)Standard blocks
Transformer (all LLMs)Two residuals per block (attn + FFN)
U-Net (segmentation)Skip connections between encoder/decoder
DenseNetDense skip connections across all layers
EfficientNetMBConv blocks with residuals

Trade-offs

  • Residual connections add essentially zero parameter cost (the identity shortcut has no parameters unless a projection is needed).
  • Memory cost: during backpropagation, activations at each residual junction must be stored. Gradient checkpointing trades memory for recompute.
  • For very shallow networks (2–3 layers), residuals provide minimal benefit.
  • Pre-LN placement combined with residuals is currently the most stable configuration for large language models.