Week 1
Neural Network Basics
Biological Motivation
The architecture of artificial neural networks is loosely inspired by biological neural systems.
- Signal Reception and Integration: In biological neurons, dendrites receive chemical signals from other neurons. These signals are integrated in the cell body (soma).
- Thresholding and Activation: If the integrated signal surpasses a certain threshold, the neuron “fires,” generating an action potential. This thresholding mechanism introduces non-linearity into the system, as the neuron only transmits a signal if the input excitation is sufficient.
- Signal Transmission: The action potential travels down the axon to the axon terminals.
- Synaptic Transmission: At the axon terminals, neurotransmitters are released into the synapse (the junction between two neurons), which are then received by the dendrites of a subsequent neuron, continuing the signal propagation.
Perceptron
The Perceptron, introduced by Frank Rosenblatt in 1958, is one of the earliest and simplest types of artificial neural networks, functioning as a linear classifier.
- Definition: For an input vector \(x\), the Perceptron computes an output \(y\) based on weights \(w\) and a bias \(b\). The output is typically determined by whether the weighted sum \(w^T x + b\) exceeds a threshold (often 0): \[ y = \begin{cases} 1 & \text{if } w^T x + b > 0 \\ 0 & \text{otherwise} \end{cases} \] This can be seen as a simple linear classifier.
- Training:
- The Perceptron is typically trained using a form of gradient descent, specifically the Perceptron learning rule, which is a type of stochastic gradient descent.
- Early Perceptron models used a step activation function, resulting in binary outputs (e.g., 0 or 1, or -1 and 1). The learning rule updates weights based on misclassifications.
- Capabilities and Limitations:
- If a dataset is linearly separable, the Perceptron convergence theorem guarantees that the Perceptron learning algorithm will find a separating hyperplane in a finite number of steps.
- However, the Perceptron does not necessarily find an optimal separating hyperplane (e.g., one that maximizes the margin, as Support Vector Machines do). It simply finds a solution.
- If the data is not linearly separable, the basic Perceptron algorithm may not converge.
- Modern Context: Modern neural networks build upon these foundational concepts by employing multiple layers of neurons (forming “deep” networks) and using smooth, differentiable activation functions instead of the Heaviside step function. This allows for the learning of more complex, non-linear decision boundaries and the use of more sophisticated gradient-based optimization algorithms like backpropagation.
Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE) is a fundamental statistical method used to estimate the parameters \(\theta\) of a model. In the context of supervised learning, we aim to model the conditional probability \(p_{\text{model}}(y | X, \theta)\), where \(y\) represents the target variable(s) and \(X\) represents the input features.
Likelihood Function: Assuming independent and identically distributed (i.i.d.) data samples \((x_i, y_i)\), the likelihood of observing the entire dataset \(\mathcal{D} = \{(x_1, y_1), \dots, (x_N, y_N)\}\) is given by: \[ L(\theta | X, Y) = P(Y | X, \theta) = \prod_{i=1}^{N} p_{\text{model}}(y_i | x_i, \theta) \]
Log-Likelihood: It is often more convenient to work with the log-likelihood, as it converts products into sums and does not change the location of the maximum: \[ \mathcal{L}(\theta | X, Y) = \log L(\theta | X, Y) = \sum_{i=1}^{N} \log p_{\text{model}}(y_i | x_i, \theta) \] In machine learning, the negative log-likelihood (NLL) is commonly used as a loss function to be minimized: \[ \text{Loss}(\theta) = - \mathcal{L}(\theta | X, Y) = - \sum_{i=1}^{N} \log p_{\text{model}}(y_i | x_i, \theta) \]
Gradient for Optimization: To find the parameters \(\theta\) that maximize the likelihood (or minimize the NLL), we typically take the derivative of the log-likelihood with respect to \(\theta\), set it to zero, and solve. For complex models, iterative optimization methods like gradient descent are used:
\[ \nabla_{\theta} \mathcal{L}(\theta | X, Y) = \sum_{i=1}^{N} \nabla_{\theta} \log p_{\text{model}}(y_i | x_i, \theta) \]
- Gradient of Sigmoid Activation: For a neuron with a sigmoid activation function \(\sigma(z) = \frac{1}{1 + e^{-z}}\), where \(z = w^T x + b\), its derivative with respect to \(z\) is: \[ \frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)) \]
- Gradient of Softmax Activation: For a softmax output layer \(S(z)_j = \frac{e^{z_j}}{\sum_k e^{z_k}}\), the derivative of the \(j\)-th softmax output with respect to the \(k\)-th input \(z_k\) (before softmax) is: \[ \frac{\partial S_j}{\partial z_k} = S_j (\delta_{jk} - S_k) \] where \(\delta_{jk}\) is the Kronecker delta (1 if \(j=k\), 0 otherwise). When combined with a cross-entropy loss, this derivative simplifies nicely.
MLE and Common Loss Functions:
- Gaussian Likelihood and Mean Squared Error (MSE): If we assume that the target variable \(y\) given input \(x\) follows a Gaussian distribution with mean \(f(x, \theta)\) (the model’s prediction) and constant variance \(\sigma^2\): \[ p_{\text{model}}(y_i | x_i, \theta) = \mathcal{N}(y_i | f(x_i, \theta), \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - f(x_i, \theta))^2}{2\sigma^2}\right) \] The negative log-likelihood for one sample is: \[ -\log p_{\text{model}}(y_i | x_i, \theta) = \frac{1}{2}\log(2\pi\sigma^2) + \frac{(y_i - f(x_i, \theta))^2}{2\sigma^2} \] Minimizing the sum over all samples, and ignoring constants, is equivalent to minimizing the Mean Squared Error (MSE): \[ \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i, \theta))^2 \]
- Laplace Likelihood and Mean Absolute Error (MAE): If we assume \(y\) follows a Laplace distribution with mean \(f(x, \theta)\) and scale parameter \(b\): \[ p_{\text{model}}(y_i | x_i, \theta) = \text{Laplace}(y_i | f(x_i, \theta), b) = \frac{1}{2b} \exp\left(-\frac{|y_i - f(x_i, \theta)|}{b}\right) \] The negative log-likelihood for one sample is: \[ -\log p_{\text{model}}(y_i | x_i, \theta) = \log(2b) + \frac{|y_i - f(x_i, \theta)|}{b} \] Minimizing the sum over all samples, and ignoring constants, is equivalent to minimizing the Mean Absolute Error (MAE): \[ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - f(x_i, \theta)| \]
MLE and Regularization (via Maximum A Posteriori - MAP):
- If we introduce a Gaussian prior distribution over the model weights \(\theta \sim \mathcal{N}(0, \alpha^2 I)\) and perform Maximum A Posteriori (MAP) estimation, this is equivalent to minimizing the NLL plus an \(L_2\) regularization term (Ridge regression): \[ \hat{\theta}_{MAP} = \arg\max_{\theta} \left( \sum_{i=1}^{N} \log p_{\text{model}}(y_i | x_i, \theta) + \log p(\theta) \right) \] This leads to a loss function like: \(\text{NLL} + \lambda ||\theta||_2^2\).
- If we introduce a Laplace prior over the model weights \(\theta \sim \text{Laplace}(0, \beta)\) and perform MAP estimation, this is equivalent to minimizing the NLL plus an \(L_1\) regularization term (Lasso regression): This leads to a loss function like: \(\text{NLL} + \lambda ||\theta||_1\).
Properties of MLE:
- Consistency: MLE is consistent, meaning that as the number of samples \(N\) approaches infinity, the MLE \(\hat{\theta}_{MLE}\) converges in probability to the true parameter value \(\theta_0\).
- Asymptotic Efficiency: MLE is asymptotically efficient. It achieves the Cramér-Rao lower bound, meaning it has the lowest possible variance among all asymptotically unbiased estimators as \(N \to \infty\).
- Asymptotic Normality: Under certain regularity conditions, the distribution of \(\hat{\theta}_{MLE}\) approaches a normal distribution as \(N \to \infty\).
Backpropagation
Backpropagation is the cornerstone algorithm for training multi-layer neural networks. It efficiently computes the gradient of the loss function with respect to the network’s weights by applying the chain rule of calculus.
Chain Rule: For a composite function, e.g., \(L(y, \hat{y})\) where \(\hat{y} = f(z)\) and \(z = g(W, x)\), the gradient of \(L\) with respect to weights \(W\) in an earlier layer is computed by propagating gradients backward from the output layer. If \(L\) is the loss, \(y^{(L)}\) is the output of layer \(L\), \(y^{(l)}\) is the output of layer \(l\), and \(W^{(l)}\) are the weights of layer \(l\), then:
\[ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial y^{(L)}} \frac{\partial y^{(L)}}{\partial y^{(L-1)}} \dots \frac{\partial y^{(l+1)}}{\partial y^{(l)}} \frac{\partial y^{(l)}}{\partial W^{(l)}} \]
More precisely, for a given layer \(l\), the error \(\delta^{(l)}\) is propagated backwards:
\[ \delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot \sigma'(z^{(l)}) \]
where \(\sigma'\) is the derivative of the activation function at layer \(l\), \(z^{(l)}\) is the pre-activation output of layer \(l\), and \(\odot\) denotes element-wise product. The gradient with respect to weights \(W^{(l)}\) is then:
\[ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T \]
where \(a^{(l-1)}\) is the activation from the previous layer.
Universal Approximation Theorem (UAT): This theorem provides a theoretical justification for the expressive power of neural networks.
- Statement (Cybenko 1989, Hornik et al. 1989): Let \(\sigma : \mathbb{R} \to \mathbb{R}\) be a non-constant, bounded, and continuous activation function. Let \(I_m\) denote the \(m\)-dimensional unit hypercube \([0,1]^m\). The space of continuous real-valued functions on \(I_m\) is denoted by \(C(I_m)\). Then, for any function \(f \in C(I_m)\) and any \(\epsilon > 0\), there exists an integer \(N\) (the number of neurons in the hidden layer), real constants \(v_i, b_i \in \mathbb{R}\), and real vectors \(w_i \in \mathbb{R}^m\) for \(i = 1, \dots, N\), such that if we define: \[ g(x) = \sum_{i=1}^{N} v_i \sigma(w_i^T x + b_i) \] then \(|g(x) - f(x)| < \epsilon\) for all \(x \in I_m\). (More formally, \(\sup_{x \in I_m} |g(x) - f(x)| < \epsilon\).)
- Implications: In essence, the theorem states that a feed-forward neural network with a single hidden layer containing a finite number of neurons and a suitable non-linear continuous activation function can approximate any continuous function defined on a compact subset of \(\mathbb{R}^m\) to an arbitrary degree of precision.
- Caveats:
- The theorem is an existence proof; it does not specify how to find the weights \(w_i, v_i\) or biases \(b_i\), nor does it guarantee that they can be learned by a particular algorithm like gradient descent.
- The number of neurons \(N\) required might be very large.
- It applies to continuous activation functions. For non-continuous ones like ReLU (Rectified Linear Unit), similar theorems exist.