Word Embeddings

Definition

Dense, low-dimensional real-valued vector representations of tokens where geometric proximity encodes semantic and syntactic similarity; learned from large corpora without explicit supervision.

Intuition

One-hot vectors have no notion of similarity; word embeddings capture that “king” and “queen” are more similar than “king” and “table”; the vector arithmetic $king - man + woman \approx queen$ reveals that linear structure encodes semantic relationships.

Formal Description

Word2Vec: two tasks for training embeddings — (a) Skip-gram: predict context words given center word; (b) CBOW: predict center word from context; objective: maximize $\sum lo g P (context ∣ center)$ ; embedding matrix $E \in R^{d \times ∣ V ∣}$ ; $e_{w} = E \cdot o_{w}$ where $o_{w}$ is one-hot.

Negative sampling: approximate the full softmax over $∣ V ∣$ words by sampling $k$ “negative” words for each positive context pair; loss: $lo g σ (u_{c_{0}}^{⊤} v_{w}) + \sum_{j = 1}^{k} lo g σ (- u_{c_{j}}^{⊤} v_{w})$ ; $k = 5$ – $20$ for small datasets; dramatically reduces computation from $O (∣ V ∣)$ to $O (k)$ .

GloVe: instead of local window predictions, factorize the global co-occurrence matrix $X$ ; objective: $\sum_{i, j} f (X_{ij}) (w_{i}^{⊤} \tilde{w}_{j} + b_{i} + \tilde{b}_{j} - lo g X_{ij})^{2}$ where $f$ is a weighting function; captures global statistics directly; typically faster to train on large corpora.

Debiasing: embeddings trained on real text encode human biases (gender, racial); mitigation steps: (1) identify bias direction via SVD on gender-definitional pairs; (2) neutralize: project non-definitional words onto the complement of bias direction; (3) equalize: make definitional pairs equidistant from bias direction; limitations: debiasing is incomplete and can introduce other artifacts.

Applications

Feature representations for NLP models, semantic search, recommendation systems (item embeddings), drug discovery (molecular embeddings); static embeddings (Word2Vec, GloVe) largely replaced by contextual embeddings (BERT, GPT) but still useful as lightweight baselines.

Trade-offs

Static embeddings assign one vector per word regardless of context (“bank” as in river vs. finance); out-of-vocabulary words have no representation; dimension $d$ is a hyperparameter (typically 100–300); GloVe and Word2Vec often give similar downstream performance.

Notes

Explorer

word_embeddings

Word Embeddings

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks