Residual Connections

Definition

A residual connection (skip connection) adds the input of a sub-layer directly to its output:

y = F (x) + x

where $F (x)$ is the sub-layer transformation (e.g., a convolutional block or a transformer FFN) and $x$ is the identity shortcut. The block learns the residual $F (x) = y - x$ rather than a direct mapping.

Intuition

Training very deep networks ( $> 20$ layers) with plain stacked layers was empirically nearly impossible before residual connections — both gradients and activations degraded. The residual shortcut creates a gradient highway: during backpropagation, gradient can flow directly through the identity path without passing through any non-linearities. This solves the vanishing gradient problem for depth.

The “residual” framing is also conceptually appealing: if the identity is already a good approximation, $F (x)$ only needs to learn small corrections, which is easier than learning the full mapping from scratch.

Formal Description

ResNet block (He et al., 2016):

y = F (x; W_{1}, W_{2}) + x

F (x) = W_{2} ReLU (W_{1} x)

When dimensions don’t match (different channel/feature sizes), a projection shortcut $x^{'} = W_{s} x$ is used:

y = F (x) + W_{s} x

Gradient flow analysis: backpropagation through a residual block:

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} = \frac{\partial L}{\partial y} (1 + \frac{\partial F}{\partial x})

The “1” term means gradient flows to layer $l$ directly from any deeper layer $L$ without multiplicative shrinkage:

\frac{\partial L}{\partial x _{l}} = \frac{\partial L}{\partial x _{L}} i = l \prod L - 1 (1 + \frac{\partial F _{i}}{\partial x _{i}})

The product can no longer vanish as long as the individual $\partial F_{i} / \partial x_{i}$ terms are small.

Ensemble interpretation (Veit et al., 2016): unrolling a residual network with $L$ blocks produces $2^{L}$ paths through the network. The output is effectively an ensemble of networks of varying depth. Most of the gradient flows through shorter paths, making the network robust to depth.

Transformer residual structure:

# Pre-LN transformer block
y = x + Attention(LN(x))
z = y + FFN(LN(y))

Every transformer block has two residual connections: one around the attention sub-layer and one around the FFN. This is why transformers can scale to hundreds of layers.

Dense connections (DenseNet): every layer receives input from all previous layers:

x_{l} = H_{l} ([x_{0}, x_{1}, \dots, x_{l - 1}])

More extreme form of skip connections; useful for tasks requiring multi-scale features (segmentation, detection).

Highway networks: learned gating of the skip connection:

y = T (x) \cdot F (x) + (1 - T (x)) \cdot x

where $T (x) = σ (W_{T} x + b_{T})$ is a trainable gate. Precursor to ResNets; less popular because ungated residuals work equally well.

Applications

Architecture	Residual variant
ResNet (image classification)	Standard $F (x) + x$ blocks
Transformer (all LLMs)	Two residuals per block (attn + FFN)
U-Net (segmentation)	Skip connections between encoder/decoder
DenseNet	Dense skip connections across all layers
EfficientNet	MBConv blocks with residuals

Trade-offs

Residual connections add essentially zero parameter cost (the identity shortcut has no parameters unless a projection is needed).
Memory cost: during backpropagation, activations at each residual junction must be stored. Gradient checkpointing trades memory for recompute.
For very shallow networks (2–3 layers), residuals provide minimal benefit.
Pre-LN placement combined with residuals is currently the most stable configuration for large language models.

Notes

Explorer

residual_connections

Residual Connections

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks