Part 1 - Week 3

Ryan Cotterell

Published

Tuesday, March 4, 2025

Modeling Foundations

The decisions when we want to build a distribution over strings and estimate its parameters.
How can a sequence model be parameterized?
\(p_{\omega}: \Sigma^* \rightarrow \mathbb{R}\), where \(\Sigma^*\) denotes the Kleene closure of the alphabet \(\Sigma\).
Where parameters \(\omega \in \Theta\) are free parameters of the function.
Given a parameterized model, how can the parameters be chosen to best reflect the data?

Representation-Based Language Models

Most modern language models are defined as locally normalized (LN) models.
However, we first define a sequence model and prove that a specific parameterization encodes a tight LN model.
We parameterize to enable optimization with respect to some objective.
One way is to embed individual symbols and possible contexts into a Hilbert space, enabling a notion of similarity.
Principle: Good representation principle: The success of a machine learning (ML) task depends on chosen representations; in our case, representations of symbols and contexts.
Definition: Vector space over a field \(\mathbb{F}\): A set \(V\) with two operations—addition and scalar multiplication—satisfying 8 axioms (associativity, commutativity, identity element of addition, inverse elements of addition, compatibility of scalar multiplication with field multiplication, identity element of scalar multiplication, distributivity of scalar multiplication with respect to vector addition, and distributivity of scalar multiplication with respect to field addition).
A \(D\)-dimensional vector space has \(D\) basis vectors, representable as a \(D\)-dimensional vector in \(\mathbb{F}^D\).
An inner product is defined as a bilinear form \(\langle \cdot, \cdot \rangle: V \times V \rightarrow \mathbb{F}\) satisfying (conjugate) symmetry, linearity in the first argument, and positive-definiteness.
The inner product induces a norm \(\|\cdot\|: V \rightarrow \mathbb{R}_{\geq 0}\) defined as \(\|v\| = \sqrt{\langle v, v \rangle}\).
Hilbert space is a complete inner product space, meaning every Cauchy sequence (or every absolutely convergent sequence in the norm) converges to a limit in the space.
An inner product space can be completed into a Hilbert space.
We need a Hilbert space because infinite sequences (e.g., generated by a recurrent neural network, RNN) must lie in the space; e.g., using rationals as the field would not represent all sequences (limit might be irrational like \(\sqrt{2}\)).
Representation function from a set \(\mathcal{S}\) to a Hilbert space \(V\) is a function mapping elements to their representations.
If \(\mathcal{S}\) is finite, it can be represented as a matrix \(E \in \mathbb{R}^{|\mathcal{S}| \times D}\), where each row is the representation of a symbol.
This is the case for embedding symbols from the extended alphabet \(\bar{\Sigma} = \Sigma \cup \{\mathrm{bos}, \mathrm{eos}\}\) via an embedding function \(e: \bar{\Sigma} \rightarrow \mathbb{R}^D\).
E.g., one-hot encoding assigns each symbol its \(n\)th basis vector (matrix is identity-like); it’s sparse, and cosine similarity is 0 for non-identical symbols.
We want to capture semantic information.
For contexts, define an encoding function \(\mathrm{enc}: \bar{\Sigma}^* \rightarrow \mathbb{R}^D\) mapping the context to a vector in the same space.
For now, treat the function as a black box.

Compatibility of Symbol and Context

\(\cos(\theta) = \frac{\langle e(y), \mathrm{enc}(\boldsymbol{y}) \rangle}{\|e(y)\| \|\mathrm{enc}(\boldsymbol{y})\|}\).
Traditionally, use the inner product as the compatibility function.
For a symbol and context: \(\langle e(y), \mathrm{enc}(\boldsymbol{y}) \rangle\).
All compatibilities via matrix multiplication: \(E \cdot \mathrm{enc}(\boldsymbol{y})\).
Entries are often called scores or logits.
To formulate \(p_{\mathrm{SM}}\), project the scores into a probability simplex.
Definition: Probability simplex \(\Delta^{D-1} = \{\mathbf{x} \in \mathbb{R}^D \mid x_d \geq 0, \sum_{d=1}^D x_d = 1\}\).
\(p_{\mathrm{SM}}: V \rightarrow \Delta^{|\bar{\Sigma}|-1}\).
Need a projection function from logits to the simplex.
Requires monotonicity, positivity everywhere, and summing to 1.
Definition: Softmax function:
- \(\mathrm{softmax}: \mathbb{R}^D \rightarrow \Delta^{D-1}\).
- \(\mathrm{softmax}(\mathbf{x})_d = \frac{\exp(x_d / \varepsilon)}{\sum_{j=1}^D \exp(x_j / \varepsilon)}\), where \(\varepsilon > 0\) is the temperature.
Temperature \(\varepsilon\) controls entropy by scaling logits; in the Boltzmann distribution context, it relates to randomness.
High \(\varepsilon\): More uniform distribution; low \(\varepsilon\): More peaked; limit as \(\varepsilon \to 0^+\): Argmax (one-hot at maximum).
Variational characterization: \[ \mathrm{softmax}(\mathbf{x}) = \underset{\mathbf{p} \in \Delta^{D-1}}{\mathrm{argmax}} \left( \mathbf{p}^\top \mathbf{x} - \varepsilon \sum_{d=1}^D p_d \log p_d \right) = \underset{\mathbf{p} \in \Delta^{D-1}}{\mathrm{argmax}} \left( \mathbf{p}^\top \mathbf{x} + \varepsilon H(\mathbf{p}) \right), \] where \(H(\mathbf{p}) = -\sum_d p_d \log p_d\) is entropy.
Interpreted as the vector with maximal similarity to \(\mathbf{x}\) while regularized for high entropy.
Non-sparse (zeros only if input is \(-\infty\)).
Invariant to constant addition: \(\mathrm{softmax}(\mathbf{x} + c \mathbf{1}) = \mathrm{softmax}(\mathbf{x})\).
Derivative is computable and function is smooth.
Maintains ordering for all temperatures: If \(x_i > x_j\), then \(\mathrm{softmax}(\mathbf{x})_i > \mathrm{softmax}(\mathbf{x})_j\).
Other projections like sparsemax: \(\mathrm{sparsemax}(\mathbf{x}) = \underset{\mathbf{p} \in \Delta^{D-1}}{\mathrm{argmin}} \|\mathbf{p} - \mathbf{x}\|_2^2\) (can be sparse, but not everywhere differentiable).

Representation-Based Locally Normalized Language Model

Definition: \[ p_{\mathrm{SM}}(y_t \mid \boldsymbol{y}_{<t}) \stackrel{\text{def}}{=} f_{\Delta^{|\bar{\Sigma}|-1}} \left( E \cdot \mathrm{enc}(\boldsymbol{y}_{<t}) \right)_{y_t}, \] where \(f\) is typically softmax.
Defines probability of whole string: \(p_{\mathrm{LN}}(\boldsymbol{y}) = p_{\mathrm{SM}}(\mathrm{eos} \mid \boldsymbol{y}) \prod_{t=1}^T p_{\mathrm{SM}}(y_t \mid \boldsymbol{y}_{<t})\), with \(\boldsymbol{y}_0 = \mathrm{bos}\).
Encoding functions carry all information about symbol likelihoods given contexts.
Theorem: Let \(p_{\mathrm{SM}}\) be a representation-based sequence model over alphabet \(\Sigma\).
- \(s \stackrel{\text{def}}{=} \sup_{y \in \Sigma} \|e(y) - e(\mathrm{eos})\|_2\) (largest distance to eos representation).
- \(z_{\max}^t = \max_{\boldsymbol{y} \in \Sigma^t} \|\mathrm{enc}(\boldsymbol{y})\|_2\) (maximum context norm for length \(t\)).
- LN model induced by \(p_{\mathrm{SM}}\) is tight if \(s z_{\max}^t \leq \log t\).
Theorem: Representation-based LN models with uniformly bounded \(\|\mathrm{enc}(\boldsymbol{y})\|_p\) for some \(p \geq 1\) are tight.
Often holds in practice due to bounded activation functions, ensuring softmax assigns positive probability to eos.

Estimating a Language Model

\(D\) is a collection of \(N\) i.i.d. strings from \(\Sigma^*\) generated according to the unknown \(p_{\mathrm{LM}}\) distribution.
This is often posed as an optimization problem or search for the best parameters; therefore, we must limit the space of possible models.
We limit ourselves to a parameterized family of models \(\{p_{\omega} \mid \omega \in \Theta\}\), where \(\Theta\) is the parameter space.
General Framework
- We search for parameters \(\hat{\omega}\) that maximize a chosen objective, or alternatively, minimize some loss function \(\varsigma: \Theta \times \Theta \rightarrow \mathbb{R}_{\geq 0}\).
- We seek the model \(\hat{\omega} = \arg\min_{\omega \in \Theta} \varsigma(\omega^*, \omega)\).
- We don’t know \(\omega^*\) but can estimate it using the empirical distribution from the corpus pointwise.
- Definition: \(\hat{p}_{\omega^*}(\boldsymbol{y}) = \frac{1}{N} \sum_{n=1}^N \delta_{\boldsymbol{y}^{(n)}}(\boldsymbol{y})\).
- This can also be decomposed on the level of the symbols.
- We focus on the generative paradigm where we model the distribution not just boundaries between classes; this is usually the case for language models where we use unannotated data with self-supervised learning.
- We can now measure the distance between these distributions using cross-entropy, which has its roots in information theory.
- \(\varsigma(\omega^*, \omega) = H(\hat{p}_{\omega^*}, p_{\omega})\).
- In general, cross-entropy is defined as \(H(p_1, p_2) = -\sum_{\boldsymbol{y} \in \mathcal{Y}} p_1(\boldsymbol{y}) \log p_2(\boldsymbol{y})\), where the sum is over the support \(\mathcal{Y}\) of \(p_1\).
- Usually written in the LN form: \(H(p_1, p_2) = -\sum_{\boldsymbol{y} \in \mathcal{Y}} \sum_{t=1}^T p_1(y_t \mid \boldsymbol{y}_{<t}) \log p_2(y_t \mid \boldsymbol{y}_{<t})\).
- CE is not symmetric; CE between two distributions is the average number of bits needed to encode samples from \(p_1\) using a code optimized for \(p_2\).
- The optimal encoding for \(p_1\) uses \(-\log p_1(y)\) bits to encode an event with prob \(p_1(y)\), so the minimal value for CE is when \(p_1 = p_2\).
- Related to KL divergence, which is defined as \(D_{\mathrm{KL}}(p_1 \| p_2) = H(p_1, p_2) - H(p_1, p_1)\), which could be used as a loss with the same results since the entropy term doesn’t depend on the parameters.
Relationship to MLE
- Data likelihood \(\mathcal{L}(\omega) = \prod_{n=1}^N p_{\omega}(\boldsymbol{y}^{(n)})\).
- Principle: The optimal parameters are those that maximize the likelihood of the data observed.
- We usually work with log-likelihood, since it’s convex and numerically more stable.
- The optimal parameters found using MLE are the same as those found using minimizing the cross-entropy loss with empirical distribution.
- These multiple perspectives show us that we want to be close w.r.t. a given metric and place probability mass on the observed data.
- Theorem: MLE is consistent: Consider loss \(\varsigma(\omega^*, \omega) = H(\hat{p}_{\omega^*}, p_{\omega})\) and that data is i.i.d. generated from the language model. Under regularity conditions for the parametric family and that the minimizer is unique, the estimator \(\hat{\omega}\) is a consistent estimator, i.e., it converges in probability as \(N \rightarrow \infty\) to the true parameters \(\omega^*\).
- Up to here we assumed that the optimal model lives in the parametric space, which in practice is not the case; we can still prove relationship between the true model and its projection onto the parametric space being close to the estimated model.
Problems with CE: The model has to place mass on all strings in \(\Sigma^*\), to avoid infinite loss; this is called mean-seeking behavior.
It is unclear whether this is desirable or not.
Teacher forcing: We have only considered loss functions that condition on the previous symbol, like a time series, and even if the model predicts wrong, we condition in the next step on the ground truth; training with CE mandates this, but this can lead to bad performance if we score the model’s prediction on a whole sequence; this is called exposure bias. On the other hand, using previous predictions can lead to training instability.
Masked Language Modeling
- Sequential conditioning useful for generation.
- But seeing both the previous and next symbol is useful for classification.
- Sometimes called pseudo-log-likelihood. \[ \ell_{\mathrm{MLM}}(\omega)=\sum_{n=1}^N\sum_{t=1}^T\log p_\omega(y_t^{(n)}\mid\boldsymbol{y}_{<t}^{(n)},\boldsymbol{y}_{>t}^{(n)}). \]
- BERT (2019) best known masked language model.
- They are not LMs in the strict sense since they don’t model a valid distribution over \(\Sigma^*\).
- Popular for language tasks and fine-tuning.
Other Divergence Measures
- There has been work on using other divergence measures that can better capture the tails of the distributions since language has long tail, in contrast to mean-seeking behavior of CE.
- There are computational issues with other measures.
- For example, we can estimate the forward KL using MC only using the samples from the data but we don’t have access to the true distribution.
Scheduled Sampling (Bengio 2015)
- After initial period, some part of predictions is conditioned on the previous prediction, rather than the ground truth.
- This is done to avoid exposure bias.
- The estimator no longer consistent.
- These and other methods aim to specify an alternative distribution of high-quality texts rather than average texts and use RL methods like the REINFORCE algorithm to optimize the model.
- There can be also other auxiliary tasks that are intertwined in the training like next sentence prediction, frequency prediction, sentence order prediction, etc.; these methods usually lead to invalidate the formal requirements for a true LM.

Parameter Estimation

Finding best parameters analytically or manually is not possible.
We have to resort to numerical optimization.
To have an unbiased estimate of the model performance we have to split the data into training, validation, and test set.
Most are gradient methods.
Backprop = reverse-mode autodiff.
We can compute gradients with the same complexity as the forward pass.
Normal GD is over the whole dataset, but this is not feasible for large datasets.
Usually we use SGD or mini-batch SGD to estimate the gradients but trade off higher variance.
Curriculum learning: Start with easy training examples and gradually increase the difficulty.
Usually ADAM that uses estimates of the first and second moments of the gradients; this is often used in practice.
Parameter initialization is very important, can result in bad local minima or slow convergence; Xavier or He for ReLU.
Early stopping: Stop training when the validation loss stops improving; this is to avoid overfitting, though recently people observe double descent or grokking which suggests learning longer may be beneficial in specific cases.
Regularization: Is a technique which modifies the training with the intention of improving generalization performance perhaps at the cost of training performance, aka making the training harder in some way.
Most common in language models:
- Weight decay: Sometimes called L2 regularization, penalizes large weights by adding a term to the loss function that is proportional to the square of the magnitude of the weights.
- Entropy regularization: Principle of maximum entropy: Best prior is the one that is least informative; label smoothing and confidence penalty add terms to the loss that penalize peaky distributions.
- Dropout: Randomly sets a fraction of the activations to zero during training; this avoids overly relying on a specific signal and also can be viewed as an ensemble method.
- Batch and layer norm:
  - Batch norm: Normalizes the activations of a layer across the batch; this helps with convergence and stability.
  - Layer norm: Normalizes the activations across the features; this is often used in transformer models.