04 — Transformers

Attention-based architectures that replaced recurrent models as the dominant approach for sequence modelling and have become the foundation of large language models.

Notes

Attention Mechanism — scaled dot-product attention, multi-head attention, self-attention vs cross-attention
Transformer Architecture — encoder, decoder, positional encoding, layer norm, feed-forward blocks
Transformers Overview — BERT, GPT, T5, scaling laws, pre-training and fine-tuning paradigm
Word Embeddings — Word2Vec, GloVe, subword tokenization, positional encodings
Sequence-to-Sequence Models — encoder-decoder attention, beam search, BLEU evaluation

Notes

Explorer

index

04 — Transformers

Notes

Links

attention_mechanism

sequence_to_sequence

transformer

transformers_overview

word_embeddings