Representation Learning
Core Idea
Representation learning discovers compact, informative encodings of raw data that capture structure useful for downstream tasks. Rather than hand-engineering features, the model learns what to extract. Autoencoders do this through reconstruction; contrastive methods do it by pulling similar examples together and pushing dissimilar examples apart.
Mathematical Formulation
Autoencoder
An autoencoder comprises an encoder () and a decoder , trained to minimise reconstruction error:
The bottleneck layer is the latent representation.
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 256), nn.ReLU(),
nn.Linear(256, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 256), nn.ReLU(),
nn.Linear(256, input_dim)
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z), zVariational Autoencoder (VAE)
The VAE places a prior on the latent space and trains an approximate posterior .
Evidence Lower BOund (ELBO):
The KL term forces the posterior to match the prior, regularising the latent space. Reparameterisation trick enables backprop through the sampling step: , .
Contrastive Learning
Learns representations such that similar (positive) pairs have nearby embeddings and dissimilar (negative) pairs have distant embeddings.
InfoNCE / NT-Xent loss (SimCLR):
where and is a temperature parameter.
Positive pairs are created via data augmentation (crop, flip, colour jitter for images; token masking for text).
Inductive Bias
- Autoencoder: assumes the data lies on a low-dimensional manifold; all information must pass through the bottleneck.
- VAE: assumes a smooth, continuous latent space with a Gaussian prior; supports generation via sampling .
- Contrastive: assumes that augmented views of the same instance are semantically equivalent; requires careful negative mining.
Training Objective
| Method | Objective |
|---|---|
| Autoencoder | Minimise reconstruction loss (MSE or BCE) |
| VAE | Maximise ELBO = reconstruction − KL divergence |
| Contrastive (SimCLR) | Maximise agreement between augmented views |
Strengths
- Pre-trained representations transfer to downstream tasks with little labelled data.
- Autoencoders provide interpretable compression and can detect anomalies via high reconstruction error.
- VAEs enable controllable generation and latent-space interpolation.
Weaknesses
- Representations may not capture task-relevant structure if the reconstruction/contrastive objective is misaligned with the downstream task.
- VAEs often produce blurry reconstructions (blurred by the expected reconstruction objective).
- Contrastive learning requires large batch sizes or memory banks for sufficient negatives.
Variants
- Denoising autoencoder: corrupts inputs before encoding; forces the encoder to learn robust representations.
- Sparse autoencoder: adds L1 penalty on activations; encourages sparse, disentangled representations.
- β-VAE: scales the KL term by ; encourages disentanglement.
- BYOL / SimSiam: contrastive-free methods that avoid the need for negative pairs.
References
- Kingma, D.P. & Welling, M. (2014). “Auto-Encoding Variational Bayes.” ICLR.
- Chen, T. et al. (2020). “A Simple Framework for Contrastive Learning of Visual Representations.” ICML.
- Bengio, Y. et al. (2013). “Representation Learning: A Review and New Perspectives.” IEEE TPAMI.