Part 1 - Week 1

Ryan Cotterell

Published

Tuesday, February 18, 2025

Measure Theory Recap

It is not trivial to define a language model over an uncountable set of sequences.
We need some measure-theoretic tools to avoid paradoxes.
Most technical papers aren’t rigorous enough and assume things will work out.
\(\Omega\) is the sample space (event space), \(\mathcal{F}\) is the sigma-algebra (measurable sets), \(\mathbb{P}\) is the probability measure.
A probability space is a triple \((\Omega, \mathcal{F}, \mathbb{P})\) where:
- Definition: Sigma-algebra. \(\mathcal{F} \subseteq 2^\Omega\) is a sigma-algebra if:
  - \(\emptyset \in \mathcal{F}\).
  - \(A \in \mathcal{F} \implies A^c \in \mathcal{F}\).
  - \(A_1, A_2, \ldots\) is a countable collection in \(\mathcal{F} \implies \bigcup_{i=1}^\infty A_i \in \mathcal{F}\).
- Examples: The trivial sigma-algebra \(\{\emptyset, \Omega\}\) or the whole power set (discrete sigma-algebra).
Definition: Probability measure over a measurable space \((\Omega, \mathcal{F})\) is a function \(\mathbb{P}: \mathcal{F} \to [0, 1]\) such that:
- \(\mathbb{P}(\Omega) = 1\).
- If \(A_1, A_2, \ldots\) are countable disjoint sets in \(\mathcal{F}\), then \(\mathbb{P}(\bigcup_{i=1}^\infty A_i) = \sum_{i=1}^\infty \mathbb{P}(A_i)\), which is called sigma-additivity.
We usually cannot have \(\mathcal{F} = 2^\Omega\) because then it is provably impossible to define a probability measure that would have desired properties.
Definition: Algebra. Like a sigma-algebra but only with finite unions.
Definition: Probability pre-measure. Like a probability measure but the sigma-additivity has to hold only for unions that are in the algebra (since it’s defined over an algebra).
Carathéodory’s extension theorem: Allows us to construct simply an algebra over the set with a pre-measure and then extend it to a sigma-algebra while making this extension minimal and unique.
Definition: Random variable. A function \(X: \Omega \to \mathbb{R}\) between two measurable spaces \((\Omega, \mathcal{F})\) and \((\mathbb{R}, \mathcal{B})\) (Borel sigma-algebra) is a random variable or a measurable mapping if for all sets \(B \in \mathcal{B}\), we have \(X^{-1}(B) \in \mathcal{F}\).
Each random variable defines a new pushforward probability measure \(\mathbb{P}_X\) on \((\mathbb{R}, \mathcal{B})\) such that \(\mathbb{P}_X(B) = \mathbb{P}(X^{-1}(B))\) for all \(B \in \mathcal{B}\).

Language Models: Distributions over Strings

Inspiration from formal language theory which works with sets of structures, the simplest of which is a string.
Alphabet: Finite non-empty set of symbols \(\Sigma\).
A string over an alphabet \(\Sigma\) is a finite sequence of symbols from \(\Sigma\), denoted as \(s = s_1 s_2 \ldots s_n\) where \(s_i \in \Sigma\).
A string can be called a word.
\(|s|\) is the length of the string.
Only one string of length 0, the empty string \(\epsilon\).
New strings are formed by concatenation, e.g., \(s_1 s_2\) is a string of length \(|s_1| + |s_2|\), which is an associative operation.
Concatenation with \(\epsilon\) is the identity operation, aka \(\epsilon\) is the unit and the operation is monoidal.
Kleene star: \(\Sigma^* = \bigcup_{n=0}^\infty \Sigma^n\) where \(\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma\) (\(n\) times).
We also define \(\Sigma^+ = \bigcup_{n=1}^\infty \Sigma^n\), which is the set of all non-empty strings over \(\Sigma\).
We also define the set of all infinite strings over \(\Sigma\) as \(\Sigma^\infty = \Sigma \times \Sigma \times \ldots\) (infinitely many times).
In computer science (CS), strings are usually finite (not everywhere).
Kleene star contains all finite strings, including the empty string, and is countably infinite.
The set of infinite strings is uncountable for alphabets with more than one symbol.
Definition: Language. Let \(\Sigma\) be an alphabet, then a language over \(\Sigma\) is a subset of \(\Sigma^*\).
If not specified, we usually assume the language is the whole Kleene star.
Subsequence: A string \(s\) is a subsequence of a string \(t\) if we can obtain \(s\) by deleting some symbols from \(t\) without changing the order of the remaining symbols.
Substring: A contiguous subsequence.
Prefix and suffix are special cases of substrings that share the start or end of the string.
Prefix notation: \(y_{<n}\) is a prefix of length \(n-1\) of a string \(y\).
\(y \triangleleft y'\) to denote that \(y\) is a suffix of \(y'\).

Definition of a Language Model

Now we can write the definition of a language model.
Definition: \(\Sigma\) is an alphabet, then a language model is a discrete distribution \(p_\mathrm{LM}\) over \(\Sigma^*\).
Definition: Weighted language is \(L(p_\mathrm{LM}) = \{(s, p_\mathrm{LM}(s)) \mid s \in \Sigma^* \}\).
We haven’t yet said how to represent such a distribution.

Global and Local Normalization

By definition, a language model is always tight.
So by non-tight language models that appear often in practice, we mean that the distribution is not normalized, so it’s technically not a language model, and “tight” acts as a non-intersective adjective.
These non-tight models do appear, for example, from recurrent neural networks (RNNs) trained from data.
Depending on whether we model the probability of the whole string or the next symbol, we can have global or local normalization.
BOS token is a special token that indicates the start of a sequence, so it’s gonna be \(y_0\).
Augmented alphabet: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
And also Kleene star of this set is \(\bar{\Sigma}^*\).
Models that are leaking probability mass to infinite sequences will be called sequence models.

Globally Normalized Language Model or Energy Model

Definition: Energy function \(\hat{p}: \Sigma^* \to \mathbb{R}\).
Inspired from statistical physics, an energy function can be used to define a probability distribution by exponentiating its negative value.
Definition: Globally normalized language model is a function \(p_\mathrm{LM}: \Sigma^* \to [0, 1]\) such that: \[p_\mathrm{LM}(s) = \frac{e^{-\hat{p}(s)}}{\sum_{s' \in \Sigma^*} e^{-\hat{p}(s')}}.\]
They are easy to define, but hard to compute the normalization constant.
Definition: Normalizable energy function has a finite normalizer; if not, then it’s ill-defined.
Any normalizable energy function induces a language model over \(\Sigma^*\).
In general, it’s undecidable if it’s normalizable; even if so, it can be computationally impossible; it’s also hard to sample from the distribution.

Locally Normalized Models

We have to know when the termination ends: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
Definition: (Locally normalized) sequence model is a function \(p_\mathrm{SM}\) over an alphabet \(\Sigma\) is a set of conditional probabilities.
\(p_\mathrm{SM}(y \mid \mathbf{y})\) where \(\mathbf{y} \in \Sigma^*\) is the history.
SM can be also over \(\bar{\Sigma}\).
Definition: Locally normalized language model is a sequence model over \(\bar{\Sigma}\) such that: \[p_\mathrm{LM}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i}),\] for all \(\mathbf{y} \in \Sigma^*\), with \(|\mathbf{y}| = T\).
The model is tight if \(1 = \sum_{\mathbf{y} \in \Sigma^*} p_\mathrm{LM}(\mathbf{y})\).
So again, an LN LM isn’t necessarily an LM; it’s about how we can write it.
When can an LM be locally normalized?
Answer: Every model can be locally normalized.
Definition: Prefix probability. \(\pi(\mathbf{y}) = \sum_{\mathbf{y}' \in \Sigma^*} p_\mathrm{LM}(\mathbf{y} \mathbf{y}')\).
Probability that \(\mathbf{y}\) is a prefix of any string in the language, the cumulative probability of all strings that start with that prefix.
Theorem: Let \(p_\mathrm{LM}\) be a language model, then there exists an LN model such that for all \(\mathbf{y} \in \Sigma^*\), \(p_\mathrm{LM} = p_\mathrm{LN}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i})\).
Sketch of proof: We define \(p_\mathrm{SM}(y \mid \mathbf{y}) = \frac{\pi(\mathbf{y} y)}{\pi(\mathbf{y})}\) and \(p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) = \frac{p_\mathrm{LM}(\mathbf{y})}{\pi(\mathbf{y})}\).
When we then write \(p_\mathrm{LN}(\mathbf{y})\) for \(\pi(\mathbf{y}) > 0\), we get that it is the same as \(p_\mathrm{LM}\).
And for \(\pi(\mathbf{y}) = 0\), we have \(p_\mathrm{LN}(\mathbf{y}) = 0\), so it’s well-defined.
LNMs are very common in natural language processing (NLP) even though they are not necessarily language models.