05 — Multimodal Models
Architectures that jointly process and align information from multiple modalities — images and text being the primary pair, but extending to audio, video, and sensor data.
Notes
- Multimodal Models — vision-language models, cross-modal alignment, contrastive pre-training (CLIP), fusion strategies