04 — Transformers
Attention-based architectures that replaced recurrent models as the dominant approach for sequence modelling and have become the foundation of large language models.
Notes
- Attention Mechanism — scaled dot-product attention, multi-head attention, self-attention vs cross-attention
- Transformer Architecture — encoder, decoder, positional encoding, layer norm, feed-forward blocks
- Transformers Overview — BERT, GPT, T5, scaling laws, pre-training and fine-tuning paradigm
- Word Embeddings — Word2Vec, GloVe, subword tokenization, positional encodings
- Sequence-to-Sequence Models — encoder-decoder attention, beam search, BLEU evaluation