Word Embeddings
Definition
Dense, low-dimensional real-valued vector representations of tokens where geometric proximity encodes semantic and syntactic similarity; learned from large corpora without explicit supervision.
Intuition
One-hot vectors have no notion of similarity; word embeddings capture that “king” and “queen” are more similar than “king” and “table”; the vector arithmetic reveals that linear structure encodes semantic relationships.
Formal Description
Word2Vec: two tasks for training embeddings — (a) Skip-gram: predict context words given center word; (b) CBOW: predict center word from context; objective: maximize ; embedding matrix ; where is one-hot.
Negative sampling: approximate the full softmax over words by sampling “negative” words for each positive context pair; loss: ; – for small datasets; dramatically reduces computation from to .
GloVe: instead of local window predictions, factorize the global co-occurrence matrix ; objective: where is a weighting function; captures global statistics directly; typically faster to train on large corpora.
Debiasing: embeddings trained on real text encode human biases (gender, racial); mitigation steps: (1) identify bias direction via SVD on gender-definitional pairs; (2) neutralize: project non-definitional words onto the complement of bias direction; (3) equalize: make definitional pairs equidistant from bias direction; limitations: debiasing is incomplete and can introduce other artifacts.
Applications
Feature representations for NLP models, semantic search, recommendation systems (item embeddings), drug discovery (molecular embeddings); static embeddings (Word2Vec, GloVe) largely replaced by contextual embeddings (BERT, GPT) but still useful as lightweight baselines.
Trade-offs
Static embeddings assign one vector per word regardless of context (“bank” as in river vs. finance); out-of-vocabulary words have no representation; dimension is a hyperparameter (typically 100–300); GloVe and Word2Vec often give similar downstream performance.