Part 1 - Week 5

Ryan Cotterell

Published

Tuesday, March 18, 2025

Neural Network Language Models

Recurrent Neural Networks (RNNs)

Sequential Processing and Infinite Context
- RNNs are designed to process sequences one element at a time, maintaining a hidden state that summarizes all previous inputs.
- They can, in theory, make decisions based on an unbounded (potentially infinite) context.
Turing Completeness
- RNNs are Turing complete under certain assumptions (e.g., infinite precision, unbounded computation time), meaning they can simulate any computation a Turing machine can perform.
Tightness
- The concept of “tightness” for language models (LMs) is important for ensuring that the probability mass does not “leak” to infinite-length sequences.
- We will analyze under what conditions RNN-based LMs are tight.
Human Language is Not Context-Free
- Human languages exhibit phenomena (e.g., cross-serial dependencies in Swiss German) that cannot be captured by context-free grammars (CFGs).
  - Example: In Swiss German, certain dependencies between verbs and their arguments “cross” in a way that CFGs cannot model.
  - These dependencies require more expressive power than CFGs provide.
- RNNs, due to their flexible and potentially infinite state space, can model such dependencies.
RNNs as State Transition Systems
- RNNs can be viewed as systems that transition between (possibly infinitely many) hidden states.
- The hidden state in an RNN is analogous to the state in a finite-state automaton (FSA), but in language modeling, we are typically interested in the output probabilities, not the hidden state itself.
Formal Definition: Real-Valued RNN
- A real-valued RNN is a 4-tuple \((\Sigma, D, f, h_0)\):
  - \(\Sigma\): Alphabet of input symbols.
  - \(D\): Dimension of the hidden state vector.
  - \(f: \mathbb{R}^D \times \Sigma \to \mathbb{R}^D\): Dynamics map (transition function).
  - \(h_0 \in \mathbb{R}^D\): Initial hidden state.
- Rational RNNs are defined analogously, with hidden states in \(\mathbb{Q}^D\).
  - In practice, RNNs use floating-point numbers, so they are rational-valued.
  - Defining RNNs over \(\mathbb{R}\) is useful for theoretical analysis (e.g., differentiability for gradient-based learning).
RNNs as Encoders in Language Modeling
- RNNs are used to compute an encoding function \(\mathrm{enc}_{\mathcal{R}}\), which produces the hidden state for a given input prefix: \[ \mathbf{h}_t = \mathrm{enc}_{\mathcal{R}}(\mathbf{y}_{<t+1}) \overset{\mathrm{def}}{=} \mathbf{f}(\mathrm{enc}_{\mathcal{R}}(\mathbf{y}_{<t}), y_t) \in \mathbb{R}^D \]
- This encoding is then used in a general language modeling framework (see Chapter 3), typically with an output matrix \(E\) to define a sequence model (SM).
One-Hot Encoding Notation
- For a symbol \(y \in \Sigma\), the one-hot encoding is: \[ [y] \overset{\mathrm{def}}{=} \mathbf{d}_{n(y)} \]
  - Here, \(n(y)\) is the index of \(y\) in the alphabet, and \(\mathbf{d}_{n(y)}\) is the corresponding basis vector.
Activation Functions
- An activation function in an RNN operates elementwise on the hidden state vector.
- Common choices: \(\tanh\), sigmoid, ReLU.

General Results on Tightness

Translation from Chapter 3
- The tightness results for general LMs are now applied to RNNs with softmax output.
Tightness Condition for Softmax RNNs
- A softmax RNN sequence model is tight if, for all time steps \(t\), \[ s \| h_t \|_2 \leq \log t \]
  - \(s = \max_{y \in \Sigma} \| e(y) - e(\text{EOS}) \|_2\)
  - \(e(y)\) is the embedding of symbol \(y\); EOS is the end-of-sequence symbol.
Bounded Dynamics Implies Tightness
- Theorem: RNNs with a bounded dynamics map \(f\) are tight.
  - If \(|f(x)_d| \leq M\) for all \(d\) and \(x\), then the norm of the hidden state is bounded.
  - All RNNs with bounded activation functions (e.g., \(\tanh\), sigmoid) are tight.
  - RNNs with unbounded activation functions (e.g., ReLU) may not be tight.
    - Example: If the hidden state grows linearly with \(t\), the model is not tight.

Elman and Jordan Networks

Elman Network (Vanilla RNN)
- One of the earliest and simplest RNN architectures, originally used for sequence transduction.
- Dynamics Map: \[ \mathbf{h}_t = \sigma \left( \mathbf{U} \mathbf{h}_{t-1} + \mathbf{V} e(y_t) + \mathbf{b} \right) \]
  - \(\sigma\): Elementwise nonlinearity (e.g., \(\tanh\), sigmoid).
  - \(e(y_t)\): Static symbol embedding (can be learned or fixed).
  - \(\mathbf{U}\): Recurrence matrix (applies to previous hidden state).
  - \(\mathbf{V}\): Input matrix (applies to input embedding).
  - \(\mathbf{b}\): Bias vector.
- Notes:
  - \(e(y)\) can be learned directly, so \(\mathbf{V}\) is theoretically superfluous, but allows flexibility (e.g., when \(e\) is fixed).
  - The input embedding and output embedding matrices can be different.
Jordan Network
- Similar to Elman, but the hidden state is computed based on the previous output (logits), not the previous hidden state.
- Dynamics Map: \[ \begin{aligned} &\mathbf{h}_t = \sigma\left(\mathbf{U} \mathbf{r}_{t-1} + \mathbf{V} e'(y_t) + \mathbf{b}_h\right) \\ &\mathbf{r}_t = \sigma_o(\mathbf{E} \mathbf{h}_t) \end{aligned} \]
  - \(\mathbf{r}_{t-1}\): Transformed output from previous time step.
  - \(\mathbf{E}\): Output matrix.
  - \(\sigma_o\): Output nonlinearity (often softmax).
- Notes:
  - This architecture “feeds back” the output logits into the hidden state.
  - Conditional probabilities are computed by applying softmax to the output.
Matrix Naming Conventions
- \(\mathbf{U}\): Recurrence matrix (applies to previous hidden state or output).
- \(\mathbf{V}\): Input matrix (applies to input embedding).
- \(\mathbf{E}\): Output matrix (applies to hidden state to produce logits).

Variations on RNNs

Theoretical Expressiveness vs. Practical Issues
- In theory, simple RNNs (Elman/Jordan) are sufficient to model any computable language.
- In practice, they suffer from:
  - Difficulty learning long-term dependencies.
  - Vanishing and exploding gradients.
  - Limited interaction between input and hidden state (linear, summed transformation at each step).
Gating Mechanisms
- Gating allows the network to control, for each dimension of the hidden state, how much to retain, update, or forget, based on the previous hidden state and current input.
- Gates:
  - Real-valued vectors in \([0, 1]^D\).
  - Computed by gating functions (typically using sigmoid activations).
  - Act as “soft switches” for each hidden state component.
  - If a gate value is close to 0, that dimension is “forgotten” (zeroed out after elementwise multiplication).
- Benefits:
  - Helps mitigate vanishing gradient problem by allowing gradients to flow more easily.
  - Enables selective memory and forgetting.
Most Common Gated RNNs: LSTM and GRU
- LSTM (Long Short-Term Memory) [Hochreiter & Schmidhuber, 1997]
  - Maintains an additional memory cell \(c_t\).
  - Uses input, forget, and output gates to control information flow.
  - Equations: \[ \begin{aligned} &i_t = \sigma(U^i h_{t-1} + V^i e'(y_t) + b^i) \\ &f_t = \sigma(U^f h_{t-1} + V^f e'(y_t) + b^f) \\ &o_t = \sigma(U^o h_{t-1} + V^o e'(y_t) + b^o) \\ &g_t = \tanh(U^g h_{t-1} + V^g e'(y_t) + b^g) \\ &c_t = f_t \odot c_{t-1} + i_t \odot g_t \\ &h_t = o_t \odot \tanh(c_t) \end{aligned} \]
  - \(i_t\): Input gate, \(f_t\): Forget gate, \(o_t\): Output gate, \(g_t\): Candidate vector, \(c_t\): Memory cell, \(h_t\): Hidden state.
  - LSTMs have significantly more parameters than vanilla RNNs.
- GRU (Gated Recurrent Unit)
  - Simpler than LSTM, combines input and forget gates into an update gate.
  - No separate memory cell.
  - Equations: \[ \begin{aligned} &r_t = \sigma(U^r h_{t-1} + V^r e'(y_t) + b^r) \quad \text{(reset gate)} \\ &z_t = \sigma(U^z h_{t-1} + V^z e'(y_t) + b^z) \quad \text{(update gate)} \\ &g_t = \tanh(U^g (r_t \odot h_{t-1}) + V^g e'(y_t) + b^g) \quad \text{(candidate vector)} \\ &h_t = (1 - z_t) \odot g_t + z_t \odot h_{t-1} \end{aligned} \]
  - The update gate \(z_t\) determines the mixing proportion between the candidate and previous hidden state.
  - The reset gate \(r_t\) controls how much of the previous hidden state to forget.

Parallelizability: The Achilles’ Heel of RNNs

Inherent Sequentiality
- RNNs process input strictly sequentially: to compute \(h_t\), all previous \(h_{t-1}, h_{t-2}, \ldots, h_0\) must be computed first.
- This serial dependency leads to:
  - Slow training times, especially for long sequences.
  - Inability to parallelize computation across time steps during training.
  - For prediction, to compute the next token given a context, the RNN must process the entire context sequentially.
- By contrast, transformer architectures can parallelize across sequence positions during training (but not during generation, which is inherently sequential for all locally normalized LMs).

Representational Capacity of Recurrent Neural Networks

RNNs and Weighted Regular Languages

Computational Expressivity of Elman RNNs:
- Elman RNNs, also known as vanilla RNNs, are the simplest recurrent architecture and serve as a baseline for understanding RNN capabilities due to their core recurrent update mechanism.
- In practical settings with finite precision arithmetic (like floating-point numbers on computers) and real-time computation (constant number of operations per input symbol), Elman RNNs are equivalent to weighted finite-state automata, meaning they can only recognize regular languages and their weighted variants.
- This limitation arises because finite precision leads to a finite number of possible hidden states, effectively making the RNN behave like a large but finite automaton.
- Key detail: Despite being finite-state, Elman RNNs can represent automata with an exponential number of states relative to the hidden dimension (e.g., \(2^D\) possible binary configurations), providing a compact, learnable parameterization of potentially huge finite-state models.
What Language Models Can Heaviside RNNs Represent?:
- Heaviside RNNs (HRNNs) are Elman RNNs using the Heaviside step function as activation: it outputs 1 if the input is non-negative, else 0, creating binary hidden states.
- HRNNs exactly represent the weighted regular languages defined by deterministic probabilistic finite-state automata (PFSAs), which are locally normalized weighted finite-state automata.
- In other words, any distribution an HRNN computes is regular (can be modeled by a PFSA), and vice versa, any deterministic PFSA can be simulated by an HRNN.
- Limitation: HRNNs cannot represent non-regular languages, such as context-free ones (e.g., balanced parentheses or \(\{a^n b^n | n \in \mathbb{N}\}\)), because their binary states and deterministic transitions keep them within the regular class.
- Practical relevance: Models finite-precision RNNs theoretically; all real-world RNNs (with finite bits) are ultimately regular, though with vast state spaces.
- Proof intuition: From HRNN to PFSA by enumerating finite binary hidden states; from PFSA to HRNN via encoding states into hidden vectors (detailed below).
Intuition Behind the Minsky Construction, Space Complexity, and Implications of Determinism:
- Intuition of Minsky’s Construction:
  - To simulate a deterministic PFSA with states \(Q\) and alphabet \(\Sigma\), represent the current state-symbol pair \((q_t, y_t)\) as a one-hot vector in a hidden space of size \(|Q| \cdot |\Sigma|\).
  - The recurrence matrix \(U\) has columns that encode the “out-neighborhood” of each state (all possible next states and symbols from \(q\)), repeated for each incoming symbol since the incoming symbol doesn’t affect future transitions.
  - The input matrix \(V\) encodes, for each symbol \(y\), all states reachable by \(y\) from any state.
  - Bias vector set to -1 enables element-wise conjunction (AND operation) via Heaviside: Adding \(U h_t\) (out-neighborhood) and \(V [y_{t+1}]\) (y-reachable states), then applying Heaviside with bias selects the unique next state due to determinism.
  - Output matrix \(E\) encodes transition probabilities: For softmax, use log weights (with \(\log(0) = -\infty\) to exclude impossible transitions); projection yields the same local distributions as the PFSA.
  - Overall: Each RNN step simulates one PFSA transition by “looking up” and conjoining possibilities, ensuring the hidden state tracks the PFSA state.
- Space Complexity:
  - Hidden dimension \(D = |Q| \cdot |\Sigma|\), linear in both the number of states and alphabet size.
  - This is compact compared to explicitly storing a huge automaton graph, as parameters (\(U\), \(V\), \(E\)) are shared and learnable via gradients.
  - More efficient constructions exist for unweighted cases: Dewdney achieves \(O(|\Sigma| \cdot |Q|^{3/4})\) using square-root encodings and matrix decompositions; Indyk achieves optimal \(O(|\Sigma| \cdot \sqrt{|Q|})\) for binary alphabets using four-hot encodings and non-decreasing decompositions.
  - For weighted (probabilistic) cases, linearity in \(|Q|\) is necessary because independent state distributions require the output matrix to span \(\mathbb{R}^{|Q|}\).
  - Alphabet-wise: Generally linear in \(|\Sigma|\) (dense encodings fail for overlapping transitions); logarithmic possible for “log-separable” automata (at most one transition per state pair), achievable via a separation algorithm that expands states to \(O(|Q| \cdot |\Sigma|)\) but allows overall compression to \(O(\log |\Sigma| \cdot \sqrt{|\Sigma| \cdot |Q|})\).
- Implications of Determinism in RNNs:
  - RNNs are inherently deterministic: Each input sequence follows a single hidden state path, matching deterministic PFSAs.
  - Cannot directly represent non-deterministic PFSAs, which are more expressive and can be exponentially more compact (non-determinism allows ambiguity resolved by weights).
  - To simulate non-deterministic ones, must determinize first, potentially causing exponential state explosion (e.g., some PFSAs have probabilities not expressible as single terms without blowup).
  - Key detail: Example non-determinizable PFSA assigns probs like \(0.5 \cdot 0.9^n \cdot 0.1 + 0.5 \cdot 0.1^n \cdot 0.9\), requiring choice that determinism can’t replicate efficiently.
  - Takeaway: Determinism restricts RNNs to deterministic regular languages; non-determinism in classical models highlights RNN limitations under finite constraints.

Example: Deterministic PFSA Encoded via Minsky Construction

We define a small deterministic PFSA:

Alphabet: \(\Sigma = \{a, b\}\)
States: \(Q = \{q_0, q_1\}\)
Start state: \(q_0\)
Transitions:

Current state	Symbol	Next state	Probability
\(q_0\)	a	\(q_0\)	0.7
\(q_0\)	b	\(q_1\)	0.3
\(q_1\)	a	\(q_0\)	1.0
\(q_1\)	b	\(q_1\)	0.0

Minsky Construction Parameters:

Hidden dimension:

\[ D = |Q| \times |\Sigma| = 2 \times 2 = 4 \]

Hidden units correspond to \((\text{state}, \text{symbol})\) pairs:

\((q_0, a)\)
\((q_0, b)\)
\((q_1, a)\)
\((q_1, b)\)

U (out-neighborhoods, \(4 \times 4\)):

\[ U = \begin{bmatrix} 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 1 & 0 & 1 \end{bmatrix} \]

V (reachable-by-symbol masks, \(4 \times 2\)):

\[ V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{bmatrix} \]

Bias:

\[ b = (-1, -1, -1, -1)^\top \]

Activation: Heaviside step function.

Output matrix \(E\) (log-probabilities for \(\Sigma = \{a, b\}\)):

\[ E = \begin{bmatrix} \log 0.7 & \log 0.3 \\ \log 0.7 & \log 0.3 \\ 0 & -\infty \\ 0 & -\infty \end{bmatrix} \]

Rows 1–2 correspond to \(q_0\), rows 3–4 to \(q_1\).

One Update Step: From \((q_0, a)\) with next symbol \(b\)

Initial hidden state \(h_t\) = one-hot for \((q_0, a)\):

\[ h_t = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \]

Next input symbol: \(b\) → one-hot \([y_{t+1}] = (0, 1)^\top\)

Out-neighborhood mask:

\[ U h_t = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \end{bmatrix} \]

From \((q_0, a)\) → next state \(q_0\) → possible next units = \((q_0, a)\) [1], \((q_0, b)\) [2].
Reachable-by-symbol mask for \(b\):

\[ V[b] = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 1 \end{bmatrix} \]

All units with symbol \(b\) are active.
Add masks:

\[ U h_t + V[b] = \begin{bmatrix} 1 \\ 2 \\ 0 \\ 1 \end{bmatrix} \]
Subtract bias:

\[ (U h_t + V[b]) - 1 = \begin{bmatrix} 0 \\ 1 \\ -1 \\ 0 \end{bmatrix} \]
Apply Heaviside: \[ h_{t+1} = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} \] New hidden state = \((q_0, b)\).

One Sampling Step from \((q_0, b)\)

Compute logits:

\[ E h_{t+1} = \text{row 2 of } E = (\log 0.7, \log 0.3) \]
Apply softmax:

\[ P(a) = 0.7, \quad P(b) = 0.3 \]
Sample according to this distribution:
- With probability \(0.7\) → emit \(a\)
- With probability \(0.3\) → emit \(b\)
The emitted symbol becomes the next input for the following update step.

Turing Completeness of RNNs

What Needs to Change to Get Turing Completeness?:
- Start from Heaviside RNNs (regular expressivity) and make these changes:
  - Activation: Switch to saturated sigmoid, which clips to [0,1] but preserves fractional values in (0,1) for encoding unbounded data.
  - Precision: Allow infinite (arbitrary) precision in weights and computations, enabling representation of infinite sequences in fixed dimensions via decimals.
  - Time: Permit unbounded computation time per input symbol (non-real-time), allowing multi-step updates to simulate complex operations like stack manipulations.
- These relaxations enable RNNs to simulate unbounded memory structures (e.g., stacks) and thus pushdown automata or Turing machines.
- Key detail: Infinite precision stores entire histories as infinite decimals (e.g., stack 00110111 as 0.33313311); unbounded time allows dividing/multiplying fractions without fixed steps.
Core Result: Saturated Sigmoid Elman RNNs are Turing Complete:
- They can simulate any Turing machine by encoding two-stack pushdown automata (known to be Turing equivalent).
- Single-Stack Simulation (Recognizes Deterministic Context-Free Languages):
  - Stack encoding: Represent sequence as base-10 fraction (e.g., symbols 0/1 as digits 1/3; push by divide by 10 + offset, pop by multiply by 10 - offset).
  - Hidden state components (21 dimensions): Data (stack + two buffers for temp copies), top flags (empty, top=0, top=1), configs (top + input symbol combos), computations (try all actions: push0/1, pop0/1, no-op), acceptance flag.
  - Update in four phases per symbol (unbounded time): (1) Detect top symbol, (2) combine with input to flag config, (3) compute all actions but zero invalid ones (based on PDA transitions), (4) copy valid result to stack.
  - Details: Use matrix multiplications for copying/erasing; saturated sigmoid for detection (e.g., AND via sum-threshold, conditional reset).
  - Example: Simulates PDA on “ab” with phase-by-phase hidden state tables, showing stack updates from empty to 0.1 then back to empty.
- Two-Stack Extension: Duplicate data/computation components for second stack; updates handle both independently, simulating two-stack PDA.
- Takeaway: RNNs jump from regular to Turing complete with relaxations, encoding infinite memory in finite dimensions via precision.
Is a Real-Time RNN Turing Complete?:
- No: Real-time RNNs (fixed constant operations per symbol) are limited to regular languages, equivalent to finite-state automata.
- Reason: Cannot perform the unbounded computations needed to manipulate infinite-precision encodings (e.g., arbitrary stack ops) within constant time; hidden states remain effectively finite.
- Key detail: Practical RNNs are real-time, hence regular; Turing completeness strictly requires non-real-time multi-phase updates.
- Takeaway: Time bound is crucial; without it, RNNs are universal, but real-world constraints keep them finite-state.

Computational Power of RNN Variants

LSTMs: Empirically learn counting (e.g., recognize \(a^n b^n c^n\)); behave like counter machines, crossing Chomsky hierarchy (some context-free and context-sensitive languages, but not all CF like Dyck).
- Detail: Memory cell acts as counter; more expressive than vanilla RNNs/GRUs.
GRUs and Elman: Rationally recurrent (hidden updates computable by finite-state machines); limited to regular languages.
LSTMs: Not rationally recurrent, allowing greater expressivity (e.g., counting beyond regular).
Takeaway: Gating in LSTMs enables mechanisms like counting, surpassing vanilla RNNs but still below full Turing under practical constraints.

Consequences of Turing Completeness

Turing completeness implies undecidability of key properties: No algorithms exist to determine tightness (prob mass on finite strings =1), find highest-probability string, check equivalence of two RNN LMs, or minimize hidden size while preserving the language model.
- Reason: Reductions from the halting problem (undecidable: Does a Turing machine halt on input?).
- Details: Construct RNNs simulating arbitrary TMs; property holds iff TM halts (e.g., tightness if simulation ends, else mass leaks to infinity).
Takeaway: While theoretically powerful, RNNs inherit TM undecidability, making tasks like verification or optimization impossible in general.