Convolution, Padding, Stride, and Pooling

Definition

The core spatial operations in CNNs — convolution applies a learnable filter over local receptive fields; padding preserves spatial dimensions; stride controls the step size; pooling downsamples feature maps.

Intuition

Convolution detects local patterns (edges, textures) using shared weights, making the operation translation-equivariant and vastly more parameter-efficient than fully connected layers. Pooling introduces translation invariance and reduces computation. Many deep learning libraries implement cross-correlation (no kernel flip), though the term “convolution” is used throughout.

Formal Description

Convolution (cross-correlation in practice): output element is computed as:

Filter is shared across all spatial positions.

Output size:

where = input size, = padding, = filter size, = stride.

Padding modes:

  • “Valid” convolution: , output shrinks
  • “Same” convolution: , output same size as input (for stride 1)

Multiple output channels: filters each producing one output channel; total parameters = (with biases).

3D input (color images): , filter .

Pooling: max pooling takes the maximum over an window; average pooling takes the mean; typically halves spatial dimensions; no learnable parameters.

Applications

All CNN-based vision models; feature extraction for images. Modern variants include depthwise separable convolution (MobileNet), dilated convolution, and transposed convolution (upsampling).

Trade-offs

  • Parameter sharing assumes translation invariance — fails for tasks with position-specific patterns
  • Pooling loses spatial precision (problematic for detection/segmentation, which use feature pyramids instead)
  • Max pooling is not differentiable at ties (but in practice this is ignored)