Representation Learning

Core Idea

Representation learning discovers compact, informative encodings of raw data that capture structure useful for downstream tasks. Rather than hand-engineering features, the model learns what to extract. Autoencoders do this through reconstruction; contrastive methods do it by pulling similar examples together and pushing dissimilar examples apart.

Mathematical Formulation

Autoencoder

An autoencoder comprises an encoder () and a decoder , trained to minimise reconstruction error:

The bottleneck layer is the latent representation.

import torch
import torch.nn as nn
 
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, input_dim)
        )
 
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

Variational Autoencoder (VAE)

The VAE places a prior on the latent space and trains an approximate posterior .

Evidence Lower BOund (ELBO):

The KL term forces the posterior to match the prior, regularising the latent space. Reparameterisation trick enables backprop through the sampling step: , .

Contrastive Learning

Learns representations such that similar (positive) pairs have nearby embeddings and dissimilar (negative) pairs have distant embeddings.

InfoNCE / NT-Xent loss (SimCLR):

where and is a temperature parameter.

Positive pairs are created via data augmentation (crop, flip, colour jitter for images; token masking for text).

Inductive Bias

  • Autoencoder: assumes the data lies on a low-dimensional manifold; all information must pass through the bottleneck.
  • VAE: assumes a smooth, continuous latent space with a Gaussian prior; supports generation via sampling .
  • Contrastive: assumes that augmented views of the same instance are semantically equivalent; requires careful negative mining.

Training Objective

MethodObjective
AutoencoderMinimise reconstruction loss (MSE or BCE)
VAEMaximise ELBO = reconstruction − KL divergence
Contrastive (SimCLR)Maximise agreement between augmented views

Strengths

  • Pre-trained representations transfer to downstream tasks with little labelled data.
  • Autoencoders provide interpretable compression and can detect anomalies via high reconstruction error.
  • VAEs enable controllable generation and latent-space interpolation.

Weaknesses

  • Representations may not capture task-relevant structure if the reconstruction/contrastive objective is misaligned with the downstream task.
  • VAEs often produce blurry reconstructions (blurred by the expected reconstruction objective).
  • Contrastive learning requires large batch sizes or memory banks for sufficient negatives.

Variants

  • Denoising autoencoder: corrupts inputs before encoding; forces the encoder to learn robust representations.
  • Sparse autoencoder: adds L1 penalty on activations; encourages sparse, disentangled representations.
  • β-VAE: scales the KL term by ; encourages disentanglement.
  • BYOL / SimSiam: contrastive-free methods that avoid the need for negative pairs.

References

  • Kingma, D.P. & Welling, M. (2014). “Auto-Encoding Variational Bayes.” ICLR.
  • Chen, T. et al. (2020). “A Simple Framework for Contrastive Learning of Visual Representations.” ICML.
  • Bengio, Y. et al. (2013). “Representation Learning: A Review and New Perspectives.” IEEE TPAMI.