Multi-Layer Perceptron (MLP)

Definition

A fully-connected feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer; each neuron in a layer is connected to every neuron in the adjacent layers via learned weights, and non-linear activation functions are applied at each hidden layer.

Intuition

A single linear layer can only represent linear decision boundaries. Stacking layers with non-linear activations (ReLU, sigmoid, tanh) allows the network to compose simple transformations into arbitrarily complex functions. Each hidden layer learns a new representation of the data; deeper layers capture increasingly abstract features. Given enough neurons, even a single hidden layer can approximate any continuous function — but depth makes this far more parameter-efficient.

Formal Description

Architecture: -layer network with weight matrices and bias vectors for .

Forward pass:

where (input), is the activation function at layer .

Common activation functions:

  • ReLU: — most widely used in hidden layers; avoids vanishing gradients for positive inputs
  • Sigmoid: — used for binary output; saturates at extremes
  • Tanh: — zero-centered; preferred over sigmoid for hidden layers historically
  • Softmax: — used for multi-class output

Backpropagation: apply the chain rule recursively from output to input to compute and :

Weights are updated via gradient descent (or Adam): .

Universal Approximation Theorem: a single hidden layer MLP with a non-polynomial activation and sufficiently many neurons can approximate any continuous function on a compact domain to arbitrary precision. The theorem guarantees expressibility, not trainability or generalization — depth is needed for practical efficiency.

Parameter count: for a network with layer sizes :

Applications

Tabular data classification and regression (often outperformed by gradient boosted trees), final classification head in CNNs and Transformers, function approximation, reinforcement learning value/policy networks, autoencoders (encoder/decoder components).

Trade-offs

  • No inductive bias for spatial structure (CNNs) or sequential structure (RNNs/Transformers) — requires more data when structure is present
  • Fully connected layers have parameters — expensive for high-dimensional inputs (e.g., raw images)
  • Susceptible to vanishing gradients without careful initialization (Xavier/He) and normalization (BatchNorm)
  • Depth helps expressiveness but complicates optimization; residual connections (ResNet) extend the MLP idea to very deep networks
  • Overfitting risk grows with model size; mitigated by dropout, L2 regularization, early stopping