Transformers Overview

Definition

The Transformer is a neural network architecture based entirely on self-attention mechanisms, replacing recurrence and convolution for sequence modelling. It has become the dominant architecture across NLP, vision, audio, and multimodal AI.

Intuition

Before Transformers, sequence models (RNNs, LSTMs) processed tokens one at a time, bottlenecking information through a sequential chain that degraded over long distances. Transformers instead allow every token to directly attend to every other token in a single layer, making long-range dependencies as easy to learn as local ones. Combined with large-scale pretraining, this enabled a paradigm shift: train once on massive data, fine-tune cheaply for specific tasks.

Formal Description

Core building block — self-attention:

All tokens are processed in parallel; queries, keys, and values are learned linear projections of the input.

Encoder architecture (BERT-style):

  • Tokenize input → token embeddings + positional encodings
  • Stack identical blocks: [LayerNorm → Multi-Head Self-Attention → Residual] → [LayerNorm → FFN → Residual]
  • Output: contextualized representation for each token

Decoder architecture (GPT-style):

  • Same as encoder but with causal masking on self-attention (tokens only attend to past positions)
  • Output: probability distribution over vocabulary at each position for next-token prediction

Encoder-decoder (T5/original Transformer):

  • Encoder produces representations; decoder cross-attends to encoder outputs plus autoregressively generates output tokens

Pretraining objectives:

  • Masked Language Modeling (BERT): randomly mask 15% of tokens; predict the masked tokens
  • Causal Language Modeling (GPT): predict the next token given all previous tokens
  • Span denoising (T5): mask contiguous spans and reconstruct them

Scaling: performance scales predictably with model size, data, and compute (Chinchilla scaling laws); models now range from millions to trillions of parameters.

Applications

  • NLP: BERT (classification, NER, QA), GPT (text generation), T5 (seq2seq tasks)
  • Vision: Vision Transformer (ViT) patches images into sequences; CLIP aligns vision and language
  • Multimodal: GPT-4V, Gemini — unified architecture across modalities
  • Science: AlphaFold2 (protein structure), code generation (Codex, GitHub Copilot)

Trade-offs

PropertyTransformerRNN/LSTM
ParallelismFull (training)Sequential
Long-range dependenciesO(1) path lengthO(T) path
MemoryO(T²) for attentionO(T)
Locality biasNone (needs positional encoding)Built-in
Pretrained models availableYes (vast ecosystem)Limited

Key limitations:

  • attention cost limits context length; addressed by sparse attention (Longformer), linear attention (Performer), and state-space models (Mamba)
  • No built-in inductive bias for images or sequences; requires more data than CNNs/RNNs for small datasets
  • Large pretrained models are expensive to train; fine-tuning and PEFT methods (LoRA, adapters) make deployment feasible