Notes

❯

❯

04_deep_learning

❯

05_multimodal_models

index

May 31, 20261 min read

05 — Multimodal Models

Architectures that jointly process and align information from multiple modalities — images and text being the primary pair, but extending to audio, video, and sensor data.

Notes

Multimodal Models — vision-language models, cross-modal alignment, contrastive pre-training (CLIP), fusion strategies

Links

← 04 — Transformers → 05 — Time Series

Convolutional Networks — vision encoder
Transformers — text and cross-modal encoder

1 item under this folder.

May 31, 2026
multimodal_models

Created with Quartz v4.5.2 © 2026

GitHub