Part 2 - Week 3
Mrinmaya Sachan
Multimodal Models
Motivation
- Language-only limitations:
- Even with massive corpora, purely textual pretraining may be insufficient to build a robust world model.
- Many real-world concepts are grounded in perception and action; language alone may not capture these.
- Bender & Koller, 2020 (“Climbing towards NLU”) argue that without grounding, models risk “stochastic parroting.”
- Practical drivers:
- Internet content is inherently multimodal (images, audio, video, text).
- Non-text modalities (vision, audio) have higher information bandwidth than text.
- Multimodal prompts enable richer conditioning for downstream tasks.
Vision–Language Models (VLMs)
Task Categories
- Image–Text tasks:
- Visual Question Answering (VQA) and visual reasoning.
- Image captioning.
- Image–text retrieval (both directions).
- Visual grounding (localizing objects in images from text queries).
- Text-to-image generation.
- Computer Vision tasks with language supervision:
- Object detection, segmentation, classification — improved via language-aligned supervision (e.g., CLIP).
- Video–Text tasks:
- Video captioning, video QA, temporal grounding — require modeling temporal dynamics.
Core Architectural Components
- Text encoder:
- Produces \(N\) textual feature vectors (tokens).
- Often a Transformer encoder (BERT, RoBERTa, etc.).
- Image encoder:
- Produces \(M\) visual feature vectors (patches, region proposals).
- Architectures:
- Object detection–based: R-CNN, Faster R-CNN — extract region features + bounding boxes.
- CNN-based: ResNet, EfficientNet — global or patch-level features.
- Vision Transformers (ViT, Dosovitskiy et al., 2020):
- Flatten image into fixed-size patches → linear projection → positional encoding → Transformer encoder.
- Treats patches as “visual tokens.”
- Multimodal fusion module:
- Learns cross-modal interactions.
- Decoder:
- Generates output text (for generative tasks) or classification logits (for discriminative tasks).
Multimodal Fusion Techniques
- Merged attention (single-stream):
- Concatenate text and image tokens → feed into a single Transformer stack.
- All self-attention layers operate jointly over both modalities.
- Example: VisualBERT (Li et al., 2019).
- Co-attention (dual-stream):
- Separate Transformer stacks for each modality.
- Cross-attention layers exchange information between modalities.
- Examples:
- LXMERT (Tan & Bansal, 2019).
- Flamingo (Alayrac et al., DeepMind 2022) — interleaves frozen pretrained LM with cross-attention “gated” vision layers.
Pretraining Strategies
Alignment Objectives
Language modeling with visual context:
- Predict next token (causal LM) or masked tokens (MLM) given both text and image features.
- Extends BERT/GPT objectives to multimodal inputs.
Image–Text Matching (ITM):
- Binary classification: does this image match this caption?
- Use
[CLS]
token representation → binary classifier. - Often trained with hard negatives (mismatched but semantically similar pairs).
Image–Text Contrastive Learning (ITC):
- Learn a joint embedding space for images and text.
- Given \(N\) image–caption pairs in a batch, treat the \(N\) correct pairs as positives and the \(N^2 - N\) others as negatives.
- Optimize symmetric InfoNCE loss: \[ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(v_i, t_j)/\tau)} \] where \(\text{sim}\) is cosine similarity and \(\tau\) is a temperature parameter.
- CLIP (Radford et al., OpenAI 2021) is the canonical example.
Masked Image Modeling (MIM):
- Mask random image patches and predict them from remaining patches + text context.
- Analogous to MLM in language.
- Implementation depends on visual encoder (e.g., ViT with MAE-style reconstruction).
Notable Models
- VisualBERT (Li et al., 2019):
- Single-stream Transformer.
- Pretrained with MLM + ITM.
- LXMERT (Tan & Bansal, 2019):
- Dual-stream with cross-attention.
- Pretrained with MLM, ITM, and visual feature prediction.
- CLIP (Radford et al., 2021):
- Dual encoders (image + text) trained with ITC.
- Projects both modalities into a shared embedding space.
- Enables zero-shot transfer to many vision tasks by text prompt engineering.
- ALIGN (Jia et al., 2021):
- Similar to CLIP but trained on larger noisy web data.
- Flamingo (Alayrac et al., 2022):
- Frozen pretrained LM (e.g., Chinchilla) + vision encoder + cross-attention layers.
- Supports interleaved image–text sequences.
Benefits of Multimodal Pretraining
- Improves performance on multimodal tasks and can transfer to unimodal (pure text) tasks.
- Provides grounding for language models, potentially improving factuality and reasoning about the physical world.
- Enables zero-shot and few-shot capabilities in vision tasks via text prompts.