Convolution, Padding, Stride, and Pooling
Definition
The core spatial operations in CNNs — convolution applies a learnable filter over local receptive fields; padding preserves spatial dimensions; stride controls the step size; pooling downsamples feature maps.
Intuition
Convolution detects local patterns (edges, textures) using shared weights, making the operation translation-equivariant and vastly more parameter-efficient than fully connected layers. Pooling introduces translation invariance and reduces computation. Many deep learning libraries implement cross-correlation (no kernel flip), though the term “convolution” is used throughout.
Formal Description
Convolution (cross-correlation in practice): output element is computed as:
Filter is shared across all spatial positions.
Output size:
where = input size, = padding, = filter size, = stride.
Padding modes:
- “Valid” convolution: , output shrinks
- “Same” convolution: , output same size as input (for stride 1)
Multiple output channels: filters each producing one output channel; total parameters = (with biases).
3D input (color images): , filter .
Pooling: max pooling takes the maximum over an window; average pooling takes the mean; typically halves spatial dimensions; no learnable parameters.
Applications
All CNN-based vision models; feature extraction for images. Modern variants include depthwise separable convolution (MobileNet), dilated convolution, and transposed convolution (upsampling).
Trade-offs
- Parameter sharing assumes translation invariance — fails for tasks with position-specific patterns
- Pooling loses spatial precision (problematic for detection/segmentation, which use feature pyramids instead)
- Max pooling is not differentiable at ties (but in practice this is ignored)
Links
- cnn_architecture
- backpropagation (in 01_foundations/deep_learning_theory)