Recurrent Networks

Definition

Neural architectures that process sequences by maintaining a hidden state that is updated at each time step; GRUs and LSTMs are gated variants that address the vanishing gradient problem in plain RNNs.

Intuition

A plain RNN passes a single state vector forward through time, but gradients vanish over long sequences; gates in GRU/LSTM selectively remember or forget information, allowing gradients to flow across hundreds of time steps.

Formal Description

RNN: , ; same weights shared across all time steps; types: many-to-many (machine translation via seq2seq), many-to-one (sentiment), one-to-many (text generation).

GRU: introduces update gate and relevance gate ; candidate: ; update: ; gate values near 0/1 allow exact copying of state across many steps (bypasses vanishing gradient).

LSTM: three gates — forget , update , output ; separate cell state and hidden state ; ; ; LSTM has more parameters than GRU but historically stronger on many tasks.

Applications

Time series forecasting, language modeling (token-level), speech recognition (before Transformers), NER, part-of-speech tagging; largely superseded by Transformers for NLP but still used in streaming/low-latency applications.

Trade-offs

Sequential processing limits parallelization (Transformers are fully parallel); LSTMs are harder to train than GRUs; both still suffer from vanishing gradients at very long contexts (hundreds to thousands of steps); Transformers have O(T²) attention but O(1) path length vs. O(T) for RNNs.