Convolution, Padding, Stride, and Pooling

Definition

The core spatial operations in CNNs — convolution applies a learnable filter over local receptive fields; padding preserves spatial dimensions; stride controls the step size; pooling downsamples feature maps.

Intuition

Convolution detects local patterns (edges, textures) using shared weights, making the operation translation-equivariant and vastly more parameter-efficient than fully connected layers. Pooling introduces translation invariance and reduces computation. Many deep learning libraries implement cross-correlation (no kernel flip), though the term “convolution” is used throughout.

Formal Description

Convolution (cross-correlation in practice): output element $(i, j)$ is computed as:

$\sum_{r = 0}^{f - 1} \sum_{c = 0}^{f - 1} W_{rc} \cdot X_{i \cdot s + r, j \cdot s + c}$

Filter $W \in R^{f \times f \times C_{in}}$ is shared across all spatial positions.

Output size:

$n_{out} = ⌊ \frac{n + 2 p - f}{s} ⌋ + 1$

where $n$ = input size, $p$ = padding, $f$ = filter size, $s$ = stride.

Padding modes:

“Valid” convolution: $p = 0$ , output shrinks
“Same” convolution: $p = (f - 1) /2$ , output same size as input (for stride 1)

Multiple output channels: $C_{out}$ filters each producing one output channel; total parameters = $f \times f \times C_{in} \times C_{out} + C_{out}$ (with biases).

3D input (color images): $X \in R^{H \times W \times C_{in}}$ , filter $\in R^{f \times f \times C_{in}}$ .

Pooling: max pooling takes the maximum over an $f \times f$ window; average pooling takes the mean; typically $f = 2, s = 2$ halves spatial dimensions; no learnable parameters.

Applications

All CNN-based vision models; feature extraction for images. Modern variants include depthwise separable convolution (MobileNet), dilated convolution, and transposed convolution (upsampling).

Trade-offs

Parameter sharing assumes translation invariance — fails for tasks with position-specific patterns
Pooling loses spatial precision (problematic for detection/segmentation, which use feature pyramids instead)
Max pooling is not differentiable at ties (but in practice this is ignored)

Notes

Explorer

convolution

Convolution, Padding, Stride, and Pooling

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks