Transfer Learning
Definition
Leveraging representations or parameters learned on a source task/domain to improve learning on a related target task/domain, typically by fine-tuning a pretrained model.
Intuition
Lower-level features (edges, textures, syntax) are shared across tasks; it’s wasteful to relearn them from scratch; the more similar the source and target, the more transferable the representations.
Formal Description
Full fine-tuning: initialize with pretrained weights , then train all layers on the target task with a small learning rate.
Head-only fine-tuning (feature extraction): freeze pretrained layers, train only a new output head; useful when target data is small.
When to use transfer:
- Source and target have overlapping low-level features
- Source dataset is much larger than target
- Target data is scarce
- Compute budget is limited
Pre-training → fine-tuning workflow:
- Train on large source dataset
- Replace/reinitialize output layer for target task
- Optionally freeze early layers
- Fine-tune with lower learning rate than initial training
Negative transfer: if source and target distributions are too different, pretrained features may hurt rather than help.
Applications
- Vision: ImageNet → medical imaging, autonomous driving
- NLP: BERT/GPT → downstream classification, QA, summarization
- Speech: large ASR model → specific accents or domains
Trade-offs
- Requires compatible architectures between source and target
- Fine-tuning all layers needs non-trivial target data to avoid catastrophic forgetting
- Hyperparameter sensitivity (learning rate, number of layers to freeze)
- Pretrained models can encode biases from source data