Multi-Layer Perceptron (MLP)
Definition
A fully-connected feedforward neural network consisting of an input layer, one or more hidden layers, and an output layer; each neuron in a layer is connected to every neuron in the adjacent layers via learned weights, and non-linear activation functions are applied at each hidden layer.
Intuition
A single linear layer can only represent linear decision boundaries. Stacking layers with non-linear activations (ReLU, sigmoid, tanh) allows the network to compose simple transformations into arbitrarily complex functions. Each hidden layer learns a new representation of the data; deeper layers capture increasingly abstract features. Given enough neurons, even a single hidden layer can approximate any continuous function — but depth makes this far more parameter-efficient.
Formal Description
Architecture: -layer network with weight matrices and bias vectors for .
Forward pass:
where (input), is the activation function at layer .
Common activation functions:
- ReLU: — most widely used in hidden layers; avoids vanishing gradients for positive inputs
- Sigmoid: — used for binary output; saturates at extremes
- Tanh: — zero-centered; preferred over sigmoid for hidden layers historically
- Softmax: — used for multi-class output
Backpropagation: apply the chain rule recursively from output to input to compute and :
Weights are updated via gradient descent (or Adam): .
Universal Approximation Theorem: a single hidden layer MLP with a non-polynomial activation and sufficiently many neurons can approximate any continuous function on a compact domain to arbitrary precision. The theorem guarantees expressibility, not trainability or generalization — depth is needed for practical efficiency.
Parameter count: for a network with layer sizes :
Applications
Tabular data classification and regression (often outperformed by gradient boosted trees), final classification head in CNNs and Transformers, function approximation, reinforcement learning value/policy networks, autoencoders (encoder/decoder components).
Trade-offs
- No inductive bias for spatial structure (CNNs) or sequential structure (RNNs/Transformers) — requires more data when structure is present
- Fully connected layers have parameters — expensive for high-dimensional inputs (e.g., raw images)
- Susceptible to vanishing gradients without careful initialization (Xavier/He) and normalization (BatchNorm)
- Depth helps expressiveness but complicates optimization; residual connections (ResNet) extend the MLP idea to very deep networks
- Overfitting risk grows with model size; mitigated by dropout, L2 regularization, early stopping