CNN Architecture

Definition

A feedforward architecture for processing grid-structured data (images, time-frequency spectrograms) using stacked convolution, pooling, and fully connected layers; with residual connections to enable very deep networks.

Intuition

Early layers learn low-level features (edges, colors); middle layers combine them into textures and parts; late layers form semantic concepts. Residual connections allow gradients to flow directly to early layers, making depth tractable.

Formal Description

Canonical structure: INPUT → [CONV → BN → ReLU]* → POOL → … → FC → OUTPUT

Feature map dimensions: as depth increases, decreases (via pooling/stride) while increases; a typical progression:

Why convolutions work:

  • Parameter sharing: the same filter detects the same pattern everywhere → massive parameter reduction vs. FC
  • Sparse connections: each output unit depends only on a local receptive field
  • Translation equivariance: convolving then shifting = shifting then convolving

Residual (skip) connections (ResNet):

where is a stack of convolutions. If , the block reduces to identity — making it easy for optimization to preserve input. Enables training networks with 50–1000+ layers without vanishing gradients.

Projection shortcut: when dimensions change ( and have different shapes), use convolution on the skip connection.

Applications

Image classification (ResNet, VGG, EfficientNet), object detection backbones, feature extraction for downstream tasks.

Trade-offs

  • Deep CNNs require significant compute and memory
  • Fixed-size receptive fields miss very long-range dependencies (Vision Transformers address this)
  • ResNets still dominate many production vision tasks despite ViT advances