Part 1 - Week 1
Ryan Cotterell
Measure Theory Recap
- It is not trivial to define a language model over an uncountable set of sequences.
- We need some measure-theoretic tools to avoid paradoxes.
- Most technical papers aren’t rigorous enough and assume things will work out.
- \(\Omega\) is the sample space (event space), \(\mathcal{F}\) is the sigma-algebra (measurable sets), \(\mathbb{P}\) is the probability measure.
- A probability space is a triple \((\Omega, \mathcal{F}, \mathbb{P})\) where:
- Definition: Sigma-algebra. \(\mathcal{F} \subseteq 2^\Omega\) is a sigma-algebra if:
- \(\emptyset \in \mathcal{F}\).
- \(A \in \mathcal{F} \implies A^c \in \mathcal{F}\).
- \(A_1, A_2, \ldots\) is a countable collection in \(\mathcal{F} \implies \bigcup_{i=1}^\infty A_i \in \mathcal{F}\).
- Examples: The trivial sigma-algebra \(\{\emptyset, \Omega\}\) or the whole power set (discrete sigma-algebra).
- Definition: Sigma-algebra. \(\mathcal{F} \subseteq 2^\Omega\) is a sigma-algebra if:
- Definition: Probability measure over a measurable space \((\Omega, \mathcal{F})\) is a function \(\mathbb{P}: \mathcal{F} \to [0, 1]\) such that:
- \(\mathbb{P}(\Omega) = 1\).
- If \(A_1, A_2, \ldots\) are countable disjoint sets in \(\mathcal{F}\), then \(\mathbb{P}(\bigcup_{i=1}^\infty A_i) = \sum_{i=1}^\infty \mathbb{P}(A_i)\), which is called sigma-additivity.
- We usually cannot have \(\mathcal{F} = 2^\Omega\) because then it is provably impossible to define a probability measure that would have desired properties.
- Definition: Algebra. Like a sigma-algebra but only with finite unions.
- Definition: Probability pre-measure. Like a probability measure but the sigma-additivity has to hold only for unions that are in the algebra (since it’s defined over an algebra).
- Carathéodory’s extension theorem: Allows us to construct simply an algebra over the set with a pre-measure and then extend it to a sigma-algebra while making this extension minimal and unique.
- Definition: Random variable. A function \(X: \Omega \to \mathbb{R}\) between two measurable spaces \((\Omega, \mathcal{F})\) and \((\mathbb{R}, \mathcal{B})\) (Borel sigma-algebra) is a random variable or a measurable mapping if for all sets \(B \in \mathcal{B}\), we have \(X^{-1}(B) \in \mathcal{F}\).
- Each random variable defines a new pushforward probability measure \(\mathbb{P}_X\) on \((\mathbb{R}, \mathcal{B})\) such that \(\mathbb{P}_X(B) = \mathbb{P}(X^{-1}(B))\) for all \(B \in \mathcal{B}\).
Language Models: Distributions over Strings
- Inspiration from formal language theory which works with sets of structures, the simplest of which is a string.
- Alphabet: Finite non-empty set of symbols \(\Sigma\).
- A string over an alphabet \(\Sigma\) is a finite sequence of symbols from \(\Sigma\), denoted as \(s = s_1 s_2 \ldots s_n\) where \(s_i \in \Sigma\).
- A string can be called a word.
- \(|s|\) is the length of the string.
- Only one string of length 0, the empty string \(\epsilon\).
- New strings are formed by concatenation, e.g., \(s_1 s_2\) is a string of length \(|s_1| + |s_2|\), which is an associative operation.
- Concatenation with \(\epsilon\) is the identity operation, aka \(\epsilon\) is the unit and the operation is monoidal.
- Kleene star: \(\Sigma^* = \bigcup_{n=0}^\infty \Sigma^n\) where \(\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma\) (\(n\) times).
- We also define \(\Sigma^+ = \bigcup_{n=1}^\infty \Sigma^n\), which is the set of all non-empty strings over \(\Sigma\).
- We also define the set of all infinite strings over \(\Sigma\) as \(\Sigma^\infty = \Sigma \times \Sigma \times \ldots\) (infinitely many times).
- In computer science (CS), strings are usually finite (not everywhere).
- Kleene star contains all finite strings, including the empty string, and is countably infinite.
- The set of infinite strings is uncountable for alphabets with more than one symbol.
- Definition: Language. Let \(\Sigma\) be an alphabet, then a language over \(\Sigma\) is a subset of \(\Sigma^*\).
- If not specified, we usually assume the language is the whole Kleene star.
- Subsequence: A string \(s\) is a subsequence of a string \(t\) if we can obtain \(s\) by deleting some symbols from \(t\) without changing the order of the remaining symbols.
- Substring: A contiguous subsequence.
- Prefix and suffix are special cases of substrings that share the start or end of the string.
- Prefix notation: \(y_{<n}\) is a prefix of length \(n-1\) of a string \(y\).
- \(y \triangleleft y'\) to denote that \(y\) is a suffix of \(y'\).
Definition of a Language Model
- Now we can write the definition of a language model.
- Definition: \(\Sigma\) is an alphabet, then a language model is a discrete distribution \(p_\mathrm{LM}\) over \(\Sigma^*\).
- Definition: Weighted language is \(L(p_\mathrm{LM}) = \{(s, p_\mathrm{LM}(s)) \mid s \in \Sigma^* \}\).
- We haven’t yet said how to represent such a distribution.
Global and Local Normalization
- By definition, a language model is always tight.
- So by non-tight language models that appear often in practice, we mean that the distribution is not normalized, so it’s technically not a language model, and “tight” acts as a non-intersective adjective.
- These non-tight models do appear, for example, from recurrent neural networks (RNNs) trained from data.
- Depending on whether we model the probability of the whole string or the next symbol, we can have global or local normalization.
- BOS token is a special token that indicates the start of a sequence, so it’s gonna be \(y_0\).
- Augmented alphabet: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
- And also Kleene star of this set is \(\bar{\Sigma}^*\).
- Models that are leaking probability mass to infinite sequences will be called sequence models.
Globally Normalized Language Model or Energy Model
- Definition: Energy function \(\hat{p}: \Sigma^* \to \mathbb{R}\).
- Inspired from statistical physics, an energy function can be used to define a probability distribution by exponentiating its negative value.
- Definition: Globally normalized language model is a function \(p_\mathrm{LM}: \Sigma^* \to [0, 1]\) such that: \[p_\mathrm{LM}(s) = \frac{e^{-\hat{p}(s)}}{\sum_{s' \in \Sigma^*} e^{-\hat{p}(s')}}.\]
- They are easy to define, but hard to compute the normalization constant.
- Definition: Normalizable energy function has a finite normalizer; if not, then it’s ill-defined.
- Any normalizable energy function induces a language model over \(\Sigma^*\).
- In general, it’s undecidable if it’s normalizable; even if so, it can be computationally impossible; it’s also hard to sample from the distribution.
Locally Normalized Models
- We have to know when the termination ends: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
- Definition: (Locally normalized) sequence model is a function \(p_\mathrm{SM}\) over an alphabet \(\Sigma\) is a set of conditional probabilities.
- \(p_\mathrm{SM}(y \mid \mathbf{y})\) where \(\mathbf{y} \in \Sigma^*\) is the history.
- SM can be also over \(\bar{\Sigma}\).
- Definition: Locally normalized language model is a sequence model over \(\bar{\Sigma}\) such that: \[p_\mathrm{LM}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i}),\] for all \(\mathbf{y} \in \Sigma^*\), with \(|\mathbf{y}| = T\).
- The model is tight if \(1 = \sum_{\mathbf{y} \in \Sigma^*} p_\mathrm{LM}(\mathbf{y})\).
- So again, an LN LM isn’t necessarily an LM; it’s about how we can write it.
- When can an LM be locally normalized?
- Answer: Every model can be locally normalized.
- Definition: Prefix probability. \(\pi(\mathbf{y}) = \sum_{\mathbf{y}' \in \Sigma^*} p_\mathrm{LM}(\mathbf{y} \mathbf{y}')\).
- Probability that \(\mathbf{y}\) is a prefix of any string in the language, the cumulative probability of all strings that start with that prefix.
- Theorem: Let \(p_\mathrm{LM}\) be a language model, then there exists an LN model such that for all \(\mathbf{y} \in \Sigma^*\), \(p_\mathrm{LM} = p_\mathrm{LN}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i})\).
- Sketch of proof: We define \(p_\mathrm{SM}(y \mid \mathbf{y}) = \frac{\pi(\mathbf{y} y)}{\pi(\mathbf{y})}\) and \(p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) = \frac{p_\mathrm{LM}(\mathbf{y})}{\pi(\mathbf{y})}\).
- When we then write \(p_\mathrm{LN}(\mathbf{y})\) for \(\pi(\mathbf{y}) > 0\), we get that it is the same as \(p_\mathrm{LM}\).
- And for \(\pi(\mathbf{y}) = 0\), we have \(p_\mathrm{LN}(\mathbf{y}) = 0\), so it’s well-defined.
- LNMs are very common in natural language processing (NLP) even though they are not necessarily language models.