Week 9
Diffusion Models
Definition: Diffusion models are explicit generative models that gradually add noise to data and learn to reverse this process to generate new samples.
Core Processes:
- Forward Process: Systematically adds Gaussian noise to the original data over \(T\) steps.
- Reverse Process: Learns to denoise step-by-step, generating new samples from noise.
Training Objective: The model is trained to predict the noise added at each step, not the denoised data directly.
Advantages Over Other Generative Approaches
- Normalizing Flows: Require invertibility and tractable Jacobians; diffusion models do not have this constraint.
- Variational Autoencoders (VAEs) and Autoregressive Models: VAEs and autoregressive models often produce lower-quality samples compared to diffusion models.
- Generative Adversarial Networks (GANs): GANs are difficult to train and prone to mode collapse; diffusion models offer more stable training and higher sample diversity.
- Hybridization: Diffusion models can be combined with GANs to enhance realism while maintaining stability.
Training Methodology
- Self-supervised Paradigm: The process of adding controlled noise acts as a natural supervision signal.
- Noise Scheduling:
- Progressive Approach: Start with minimal noise and gradually increase it.
- Schedulers: Linear and cosine schedules are common. Linear schedules can cause the image to become pure noise too quickly, making learning difficult in later steps.
- Computational Characteristics:
- Generation requires multiple passes (one per denoising step), similar to autoregressive models.
Architecture
- Model Core: U-Net architecture is used, shared across all timesteps.
- Time Representation: The timestep is encoded (e.g., sinusoidal encoding) and input to the model.
- Conditioning Approaches:
- Classifier Guidance:
- Adjusts noise prediction using a classifier trained on noisy images.
- Can be added post-training.
- Limitations:
- Requires a classifier trained for all noise levels.
- May introduce additional noise.
- Limited to predefined class sets.
- Direct Conditioning:
- Embeds class labels or text directly into the model.
- More flexible and effective than classifier guidance.
- Must be implemented during training.
- Classifier-Free Guidance (CFG):
- Conditioning signal (e.g., text embedding) is applied only a fraction (e.g., 80%) of the time during training.
- Model learns both conditional and unconditional denoising.
- Increases output diversity.
- Typically uses a CLIP embedding (Contrastive Language-Image Pre-training).
- A hyperparameter controls the guidance strength.
- Drawback: Slower generation, as both conditional and unconditional predictions are needed at each step.
- Classifier Guidance:
Efficiency Enhancements
- Latent Diffusion Models (LDMs):
- Diffusion is performed in a compressed latent space, not pixel space.
- Pipeline: Encoder \(\rightarrow\) Diffusion in latent space \(\rightarrow\) Decoder.
- Focuses on semantic content, not pixel-level details.
- Key Idea: A Variational Autoencoder (VAE) is pre-trained to map images to a low-dimensional latent space. The VAE is frozen during diffusion model training.
- Benefits: Operating in latent space reduces dimensionality and computational cost, improving efficiency.
- Applications: Powers state-of-the-art systems such as Stable Diffusion, DALL-E, and Sora. [Verification Needed: Sora’s exact architecture.]
- Latent Diffusion Models (LDMs):
Advanced Applications
- Image-Conditioned Generation:
- Conditioning on images (e.g., sketches, structures).
- ControlNet:
- Feature modulation approach.
- Base diffusion model is frozen.
- Parallel adaptation block with “zero convolution” layers (1x1 conv, zero-initialized).
- Zero initialization ensures gradual adaptation without disrupting the base model.
- Photo Relighting: Modifies lighting while preserving content.
- Image-Conditioned Generation:
Mathematical Foundation
Forward Diffusion Process
- Modeled as a Markov process with \(T\) steps (typically \(T \approx 1000\)).
- At each step \(t\), Gaussian noise is added: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) \] where \(\beta_t\) is the noise schedule parameter.
- Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i\).
- Using the reparameterization trick, sample \(x_t\) given \(x_0\): \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \] \[ q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \mathbf{I}) \]
Reverse Diffusion Process
- The goal is to learn \(p_\theta(x_{t-1}|x_t)\) to approximate the true reverse process.
- By Bayes’ rule: \[ q(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \]
- All terms are Gaussian, so \(q(x_{t-1}|x_t,x_0)\) is Gaussian with:
- Mean: \[ \tilde{\mu}_t(x_t,x_0) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 \]
- Variance: \[ \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \]
Training Objective
Model \(p_\theta(x_{t-1}|x_t)\) as a Gaussian with:
- Fixed variance \(\tilde{\beta}_t\) (from the forward process).
- Learned mean \(\mu_\theta(x_t,t)\) (predicted by a neural network).
ELBO Derivation:
- Maximize the log-likelihood \(\log p_\theta(x_0)\) via the Evidence Lower Bound (ELBO): \[ \log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \mathcal{L}_\text{ELBO} \]
- Expand the joint distributions: \[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t) \]
- Rewrite the forward process using Bayes’ rule:
- Start with the chain rule factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \cdot q(x_{T-1}|x_T, x_0) \cdot q(x_{T-2}|x_{T-1}, x_T, x_0) \cdots q(x_1|x_2, \ldots, x_T, x_0) \]
- Apply the Markov property: given \(x_{t+1}\) and \(x_0\), \(x_t\) is independent of future states: \[ q(x_t|x_{t+1}, x_{t+2}, \ldots, x_T, x_0) = q(x_t|x_{t+1}, x_0) \]
- This gives us the key factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0) \]
- Substitute and rearrange: \[ \mathcal{L}_\text{ELBO} = \mathbb{E}_q\left[\log\frac{p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t)}{q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0)}\right] \] \[ = \mathbb{E}_q\left[\log\frac{p(x_T)}{q(x_T|x_0)} + \log p_\theta(x_0|x_1) + \sum_{t=1}^{T-1}\log\frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] \]
- Decomposed into three terms: \[
\mathcal{L}_\text{ELBO} = \mathcal{L}_T + \mathcal{L}_0 + \mathcal{L}_{1:T-1}
\]
- Prior Matching Term (\(\mathcal{L}_T\)): \[
\mathcal{L}_T = \mathbb{E}_q\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] = -D_\text{KL}(q(x_T|x_0) \| p(x_T))
\]
- For large \(T\), \(q(x_T|x_0) \approx \mathcal{N}(0,\mathbf{I})\) and \(p(x_T) = \mathcal{N}(0,\mathbf{I})\), so this term is negligible.
- Reconstruction Term (\(\mathcal{L}_0\)): \[
\mathcal{L}_0 = \mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)]
\]
- Measures how well the final denoising step reconstructs the original data.
- Denoising Matching Terms (\(\mathcal{L}_{1:T-1}\)): \[
\mathcal{L}_{1:T-1} = \sum_{t=1}^{T-1}\mathbb{E}_q\left[\log \frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] = -\sum_{t=1}^{T-1}D_\text{KL}(q(x_t|x_{t+1}, x_0) \| p_\theta(x_t|x_{t+1}))
\]
- Measures how well the learned reverse process matches the true denoising transitions.
- Prior Matching Term (\(\mathcal{L}_T\)): \[
\mathcal{L}_T = \mathbb{E}_q\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] = -D_\text{KL}(q(x_T|x_0) \| p(x_T))
\]
Simplified Training Objective:
- If \(p_\theta(x_{t-1}|x_t)\) is Gaussian with the same variance as \(q(x_{t-1}|x_t,x_0)\), the KL divergence in \(\mathcal{L}_{1:T-1}\) reduces to the squared error between means.
- The training loss becomes: \[ \mathcal{L} = \mathbb{E}_{t,x_0,\epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right] \] \[ \left\|\epsilon-\hat{\epsilon}_{\Theta}\left(\sqrt{\overline{\alpha_t}} \mathbf{x}_{0}+\sqrt{1-\overline{\alpha_t}} \epsilon, t\right)\right\|^2 \]
- \(\epsilon_\theta(x_t, t)\) is the network’s prediction of the noise component.
Training Algorithm
\[ \begin{aligned} &\text{Algorithm: Diffusion Model Training} \\ &\text{while not converged do} \\ &\quad \mathbf{x}_0 \sim q(\mathbf{x}_0) \\ &\quad t \sim \mathcal{U}(\{1, \ldots, T\}) \\ &\quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &\quad \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon \\ &\quad \text{Take gradient step on } \nabla_{\Theta}\left\|\epsilon-\epsilon_{\Theta}\left(\mathbf{x}_t, t\right)\right\|^2 \\ &\text{end while} \end{aligned} \]
Sampling Algorithm
- Start from \(x_T \sim \mathcal{N}(0, \mathbf{I})\).
- For \(t = T\) down to \(1\):
- Predict noise: \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\).
- Compute mean: \[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}\right) \]
- If \(t > 1\), sample: \[ x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\tilde{\beta}_t}z, \quad z \sim \mathcal{N}(0, \mathbf{I}) \]
- Else, set \(x_0 = \mu_\theta(x_1, 1)\).
Additional Resources
- What are diffusion models?
- Tutorial on vision diffusion models
- Understanding diffusion models
Foundation Models
General Concept
- Train a large network to perform well on a wide range of tasks, often with minimal or no task-specific fine-tuning.
Historical Progression
- First Generation: ELMo, BERT, ERNIE
- Model is pre-trained, then a task-specific decoder is added.
- Second Generation: GPT-3
- Generalized model, fine-tuned for specific tasks.
- Third Generation: ChatGPT, DeepSeek
- Zero-shot and few-shot learning; models are instruction-tuned and can generalize to new tasks with little or no additional training.
- First Generation: ELMo, BERT, ERNIE
Vision Foundation Models
- Tokenization: Images are split into patches, which are linearly projected and fed into a transformer.
- Vision Transformer (ViT, 2020):
- Linear projection of flattened patches.
- Uses a transformer encoder.
- Includes a learnable class embedding token (CLS token) to represent the whole image.
- Masked Autoencoder (MAE) Pretraining:
- Randomly masks patches and trains the model to reconstruct them.
- Enables joint prediction, depth estimation, and image inpainting. [Verification Needed: “joint prediction” is ambiguous.]
- SAM (Segment Anything Model, Meta):
- Promptable segmentation model.
- Requires large-scale labeled data, often refined with human-in-the-loop annotation.
- CLIP:
- Shared embedding space for image and text.
- Trained with contrastive learning on image-text pairs.
- Enables dot product similarity for retrieval and zero-shot classification.
- DINO:
- Self-distillation without labels.
- Student matches the output of a teacher (an EMA of the student).
- Teacher sees the whole image; student sees a crop.
- Teacher is fixed and updated only via the student for stability.
- Enables extraction of general features with zero supervision.
Multimodal Models
- 4M (EPFL) - Massively Multimodal Masked Modeling:
- Any-to-any model: can handle image-to-text, normals, depth, 3D, etc.
- All modalities are tokenized, randomly masked, and the model is trained to predict all modalities.
- Requires knowledge of labels for each modality.
- Pseudo-labeling can be used, leveraging state-of-the-art models for each task.
- DreamBooth:
- Customized text-to-image generation.
- Fine-tunes a diffusion model on a few examples of a specific subject, associating it with a unique token.
- Zero1to3:
- Fine-tunes a diffusion model with 3D data.
- From a single image, generates multiple views (novel view synthesis).
- 4M (EPFL) - Massively Multimodal Masked Modeling: