Week 9

Published

Wednesday, April 16, 2025

Diffusion Models

Definition: Diffusion models are explicit generative models that gradually add noise to data and learn to reverse this process to generate new samples.
Core Processes:
- Forward Process: Systematically adds Gaussian noise to the original data over \(T\) steps.
- Reverse Process: Learns to denoise step-by-step, generating new samples from noise.
Training Objective: The model is trained to predict the noise added at each step, not the denoised data directly.
Advantages Over Other Generative Approaches
- Normalizing Flows: Require invertibility and tractable Jacobians; diffusion models do not have this constraint.
- Variational Autoencoders (VAEs) and Autoregressive Models: VAEs and autoregressive models often produce lower-quality samples compared to diffusion models.
- Generative Adversarial Networks (GANs): GANs are difficult to train and prone to mode collapse; diffusion models offer more stable training and higher sample diversity.
- Hybridization: Diffusion models can be combined with GANs to enhance realism while maintaining stability.
Training Methodology
- Self-supervised Paradigm: The process of adding controlled noise acts as a natural supervision signal.
- Noise Scheduling:
  - Progressive Approach: Start with minimal noise and gradually increase it.
  - Schedulers: Linear and cosine schedules are common. Linear schedules can cause the image to become pure noise too quickly, making learning difficult in later steps.
- Computational Characteristics:
  - Generation requires multiple passes (one per denoising step), similar to autoregressive models.
Architecture
- Model Core: U-Net architecture is used, shared across all timesteps.
- Time Representation: The timestep is encoded (e.g., sinusoidal encoding) and input to the model.
- Conditioning Approaches:
  - Classifier Guidance:
    - Adjusts noise prediction using a classifier trained on noisy images.
    - Can be added post-training.
    - Limitations:
      - Requires a classifier trained for all noise levels.
      - May introduce additional noise.
      - Limited to predefined class sets.
  - Direct Conditioning:
    - Embeds class labels or text directly into the model.
    - More flexible and effective than classifier guidance.
    - Must be implemented during training.
  - Classifier-Free Guidance (CFG):
    - Conditioning signal (e.g., text embedding) is applied only a fraction (e.g., 80%) of the time during training.
    - Model learns both conditional and unconditional denoising.
    - Increases output diversity.
    - Typically uses a CLIP embedding (Contrastive Language-Image Pre-training).
    - A hyperparameter controls the guidance strength.
    - Drawback: Slower generation, as both conditional and unconditional predictions are needed at each step.
Efficiency Enhancements
- Latent Diffusion Models (LDMs):
  - Diffusion is performed in a compressed latent space, not pixel space.
  - Pipeline: Encoder \(\rightarrow\) Diffusion in latent space \(\rightarrow\) Decoder.
  - Focuses on semantic content, not pixel-level details.
  - Key Idea: A Variational Autoencoder (VAE) is pre-trained to map images to a low-dimensional latent space. The VAE is frozen during diffusion model training.
  - Benefits: Operating in latent space reduces dimensionality and computational cost, improving efficiency.
  - Applications: Powers state-of-the-art systems such as Stable Diffusion, DALL-E, and Sora. [Verification Needed: Sora’s exact architecture.]
Advanced Applications
- Image-Conditioned Generation:
  - Conditioning on images (e.g., sketches, structures).
  - ControlNet:
    - Feature modulation approach.
    - Base diffusion model is frozen.
    - Parallel adaptation block with “zero convolution” layers (1x1 conv, zero-initialized).
    - Zero initialization ensures gradual adaptation without disrupting the base model.
  - Photo Relighting: Modifies lighting while preserving content.
Mathematical Foundation
- Forward Diffusion Process
  - Modeled as a Markov process with \(T\) steps (typically \(T \approx 1000\)).
  - At each step \(t\), Gaussian noise is added: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) \] where \(\beta_t\) is the noise schedule parameter.
  - Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i\).
  - Using the reparameterization trick, sample \(x_t\) given \(x_0\): \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \] \[ q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \mathbf{I}) \]
- Reverse Diffusion Process
  - The goal is to learn \(p_\theta(x_{t-1}|x_t)\) to approximate the true reverse process.
  - By Bayes’ rule: \[ q(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \]
  - All terms are Gaussian, so \(q(x_{t-1}|x_t,x_0)\) is Gaussian with:
    - Mean: \[ \tilde{\mu}_t(x_t,x_0) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 \]
    - Variance: \[ \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \]
Training Objective
- Model \(p_\theta(x_{t-1}|x_t)\) as a Gaussian with:
  - Fixed variance \(\tilde{\beta}_t\) (from the forward process).
  - Learned mean \(\mu_\theta(x_t,t)\) (predicted by a neural network).
- ELBO Derivation:
  - Maximize the log-likelihood \(\log p_\theta(x_0)\) via the Evidence Lower Bound (ELBO): \[ \log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \mathcal{L}_\text{ELBO} \]
  - Expand the joint distributions: \[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t) \]
  - Rewrite the forward process using Bayes’ rule:
    - Start with the chain rule factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \cdot q(x_{T-1}|x_T, x_0) \cdot q(x_{T-2}|x_{T-1}, x_T, x_0) \cdots q(x_1|x_2, \ldots, x_T, x_0) \]
    - Apply the Markov property: given \(x_{t+1}\) and \(x_0\), \(x_t\) is independent of future states: \[ q(x_t|x_{t+1}, x_{t+2}, \ldots, x_T, x_0) = q(x_t|x_{t+1}, x_0) \]
    - This gives us the key factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0) \]
  - Substitute and rearrange: \[ \mathcal{L}_\text{ELBO} = \mathbb{E}_q\left[\log\frac{p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t)}{q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0)}\right] \] \[ = \mathbb{E}_q\left[\log\frac{p(x_T)}{q(x_T|x_0)} + \log p_\theta(x_0|x_1) + \sum_{t=1}^{T-1}\log\frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] \]
  - Decomposed into three terms: \[ \mathcal{L}_\text{ELBO} = \mathcal{L}_T + \mathcal{L}_0 + \mathcal{L}_{1:T-1} \]
    - Prior Matching Term (\(\mathcal{L}_T\)): \[ \mathcal{L}_T = \mathbb{E}_q\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] = -D_\text{KL}(q(x_T|x_0) \| p(x_T)) \]
      - For large \(T\), \(q(x_T|x_0) \approx \mathcal{N}(0,\mathbf{I})\) and \(p(x_T) = \mathcal{N}(0,\mathbf{I})\), so this term is negligible.
    - Reconstruction Term (\(\mathcal{L}_0\)): \[ \mathcal{L}_0 = \mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)] \]
      - Measures how well the final denoising step reconstructs the original data.
    - Denoising Matching Terms (\(\mathcal{L}_{1:T-1}\)): \[ \mathcal{L}_{1:T-1} = \sum_{t=1}^{T-1}\mathbb{E}_q\left[\log \frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] = -\sum_{t=1}^{T-1}D_\text{KL}(q(x_t|x_{t+1}, x_0) \| p_\theta(x_t|x_{t+1})) \]
      - Measures how well the learned reverse process matches the true denoising transitions.
- Simplified Training Objective:
  - If \(p_\theta(x_{t-1}|x_t)\) is Gaussian with the same variance as \(q(x_{t-1}|x_t,x_0)\), the KL divergence in \(\mathcal{L}_{1:T-1}\) reduces to the squared error between means.
  - The training loss becomes: \[ \mathcal{L} = \mathbb{E}_{t,x_0,\epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right] \] \[ \left\|\epsilon-\hat{\epsilon}_{\Theta}\left(\sqrt{\overline{\alpha_t}} \mathbf{x}_{0}+\sqrt{1-\overline{\alpha_t}} \epsilon, t\right)\right\|^2 \]
  - \(\epsilon_\theta(x_t, t)\) is the network’s prediction of the noise component.
- Training Algorithm
  
  \[ \begin{aligned} &\text{Algorithm: Diffusion Model Training} \\ &\text{while not converged do} \\ &\quad \mathbf{x}_0 \sim q(\mathbf{x}_0) \\ &\quad t \sim \mathcal{U}(\{1, \ldots, T\}) \\ &\quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &\quad \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon \\ &\quad \text{Take gradient step on } \nabla_{\Theta}\left\|\epsilon-\epsilon_{\Theta}\left(\mathbf{x}_t, t\right)\right\|^2 \\ &\text{end while} \end{aligned} \]
- Sampling Algorithm
  - Start from \(x_T \sim \mathcal{N}(0, \mathbf{I})\).
  - For \(t = T\) down to \(1\):
    - Predict noise: \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\).
    - Compute mean: \[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}\right) \]
    - If \(t > 1\), sample: \[ x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\tilde{\beta}_t}z, \quad z \sim \mathcal{N}(0, \mathbf{I}) \]
    - Else, set \(x_0 = \mu_\theta(x_1, 1)\).
Additional Resources
- What are diffusion models?
- Tutorial on vision diffusion models
- Understanding diffusion models

Foundation Models

General Concept
- Train a large network to perform well on a wide range of tasks, often with minimal or no task-specific fine-tuning.
Historical Progression
- First Generation: ELMo, BERT, ERNIE
  - Model is pre-trained, then a task-specific decoder is added.
- Second Generation: GPT-3
  - Generalized model, fine-tuned for specific tasks.
- Third Generation: ChatGPT, DeepSeek
  - Zero-shot and few-shot learning; models are instruction-tuned and can generalize to new tasks with little or no additional training.
Vision Foundation Models
- Tokenization: Images are split into patches, which are linearly projected and fed into a transformer.
- Vision Transformer (ViT, 2020):
  - Linear projection of flattened patches.
  - Uses a transformer encoder.
  - Includes a learnable class embedding token (CLS token) to represent the whole image.
- Masked Autoencoder (MAE) Pretraining:
  - Randomly masks patches and trains the model to reconstruct them.
  - Enables joint prediction, depth estimation, and image inpainting. [Verification Needed: “joint prediction” is ambiguous.]
- SAM (Segment Anything Model, Meta):
  - Promptable segmentation model.
  - Requires large-scale labeled data, often refined with human-in-the-loop annotation.
- CLIP:
  - Shared embedding space for image and text.
  - Trained with contrastive learning on image-text pairs.
  - Enables dot product similarity for retrieval and zero-shot classification.
- DINO:
  - Self-distillation without labels.
  - Student matches the output of a teacher (an EMA of the student).
  - Teacher sees the whole image; student sees a crop.
  - Teacher is fixed and updated only via the student for stability.
  - Enables extraction of general features with zero supervision.
Multimodal Models
- 4M (EPFL) - Massively Multimodal Masked Modeling:
  - Any-to-any model: can handle image-to-text, normals, depth, 3D, etc.
  - All modalities are tokenized, randomly masked, and the model is trained to predict all modalities.
  - Requires knowledge of labels for each modality.
  - Pseudo-labeling can be used, leveraging state-of-the-art models for each task.
- DreamBooth:
  - Customized text-to-image generation.
  - Fine-tunes a diffusion model on a few examples of a specific subject, associating it with a unique token.
- Zero1to3:
  - Fine-tunes a diffusion model with 3D data.
  - From a single image, generates multiple views (novel view synthesis).