Discrete vs Continuous Variables
- In discrete distributions, the difference between \(\leq\) and \(<\) matters.
- Cumulative Distribution Functions (CDFs) are not used for discrete unordered sets (like “dog”, “cat”, “none”) and for categorical random variables.
Joint and Marginal Distributions
- The expected value is the integral over the domain: \[
E(g(X)) = \int g(x) p(x) \, dx
\]
- We assume the expected values are finite.
Variance and Standard Deviation
- Variance: \[
\text{Var}(X) = E((X - \mu)^2) = E(X^2) - \mu^2
\] where \(\mu\) is fixed and not a part of the integral.
- Standard deviation: \[
\sigma = \sqrt{\text{Var}(X)}
\] It has the same units as \(X\).
Covariance and Correlation
- Covariance: \[
\text{Cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y)) = E(XY) - \mu_X \mu_Y
\] where \(\mu_X\) and \(\mu_Y\) are the marginal means.
- If \(X = Y\), then \(\text{Cov}(X, X) = \text{Var}(X)\).
- Correlation: \[
\rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}
\] with range \((-1, 1)\).
Bayes’ Rule
- Bayes’ Rule: \[
P(A \cap B) = P(A|B) P(B) = P(B|A) P(A)
\]
- Conditional distribution: \[
f_{X|Y=y}(X) = \frac{f_{X,Y}(X, y)}{f_Y(y)}
\]
Independence
- \(X\) and \(Y\) are independent if and only if for all \(x\) and \(y\), \[
f(x, y) = f(x) f(y)
\]
- This is equivalent to: \[
F(x, y) = F(x) F(y)
\]
- For more dimensions, this extends with a product.
Properties of Expectations
- If \(X\) and \(Y\) are independent: \[
E(XY) = E(X) E(Y)
\]
- Expectation of conditional expectation: \[
E(X) = E(E(X|Y))
\]
Properties of Variance
- For constants \(a\) and \(b\): \[
\text{Var}(aX + b) = a^2 \text{Var}(X)
\]
- Variance of the sum: \[
\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y)
\]
- If \(X\) and \(Y\) are independent, \(\text{Cov}(X, Y) = 0\).
- Law of total variance: \[
\text{Var}(X) = E(\text{Var}(X|Y)) + \text{Var}(E(X|Y))
\]
Discrete Distributions
Bernoulli Distribution
- Mean: \(p\)
- Variance: \(p(1-p)\)
- Domain: \(\{0, 1\}\)
Binomial Distribution
- Probability mass function: \[
P(X = x) = \binom{n}{x} p^x (1-p)^{n-x}
\]
- Domain: \(x \in \{0, \ldots, n\}\)
- \(X\) counts the number of successes in \(n\) independent Bernoulli trials with probability \(p\).
- Mean: \(np\)
- Variance: \(np(1-p)\)
Poisson Distribution
- Probability mass function: \[
P(X = x) = \frac{e^{-\lambda} \lambda^x}{x!}
\]
- Domain: \(x \in \{0, 1, 2, \ldots\}\)
- \(X\) counts the number of events occurring in a fixed interval with mean \(\lambda\) and variance \(\lambda\).
Continuous Distributions
Normal Distribution
- Probability density function: \[
f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}
\]
- Domain: \(\mathbb{R}\)
- Linear transformation of a normal variable is normal.
- Standard normal variable: \[
Z = \frac{X - \mu}{\sigma} \sim N(0, 1)
\]
- Sum of independent normal variables: \[
\sum_{i=1}^n X_i \sim N\left(n\mu, n\sigma^2\right)
\]
- Mean and variance of sample mean of \(n\) i.i.d. normal variables: \[
\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)
\]
Gamma Distribution
- Probability density function with shape \(\alpha\) and rate \(\beta\): \[
f(x; \alpha, \beta) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)}
\]
- Probability density function with shape \(\alpha\) and scale \(\theta\): \[
f(x; \alpha, \theta) = \frac{x^{\alpha-1} e^{-x/\theta}}{\theta^\alpha \Gamma(\alpha)}
\]
- Domain: \(x \geq 0\)
- The exponential distribution is a gamma distribution with shape \(\alpha = 1\).
Exponential Distribution
- Probability density function with rate \(\lambda\): \[
f(x; \lambda) = \lambda e^{-\lambda x}
\]
- Domain: \(x \geq 0\)
Chi-Squared Distribution
- A chi-squared distribution with \(k\) degrees of freedom is a gamma distribution with shape \(k/2\) and rate \(1/2\).
- Sum of independent chi-squared variables is chi-squared with the sum of the degrees of freedom.
t-Distribution
- \(t\)-distribution with \(v\) degrees of freedom: \[
T = \frac{Z}{\sqrt{V/v}}
\] where \(Z \sim N(0, 1)\) and \(V \sim \chi^2(v)\).
- The \(t\)-distribution has heavier tails than the normal distribution for small degrees of freedom and converges to the normal distribution as \(v\) increases.
Log-Normal Distribution
- If \(\log(X) \sim N(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution.
- Probability density function: \[
f(x; \mu, \sigma) = \frac{1}{x \sigma \sqrt{2\pi}} e^{-\frac{(\log x - \mu)^2}{2\sigma^2}}
\]
Beta Distribution
- Probability density function with parameters \(\alpha\) and \(\beta\): \[
f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha, \beta)}
\] where \(B(\alpha, \beta)\) is the beta function.
- Alternative parameterization using shape parameters: \[
f(x; \text{shape1}, \text{shape2}) = \frac{x^{\text{shape1}-1} (1-x)^{\text{shape2}-1}}{B(\text{shape1}, \text{shape2})}
\]
- Domain: \(0 < x < 1\)
- For \(\alpha = 1\) and \(\beta = 1\), the beta distribution is the uniform distribution on \([0, 1]\).
Multivariate Normal Distribution
Definition
- The multivariate normal distribution in 2 dimensions is given by: \[
\mathbf{X} \sim N(\mathbf{\mu}, \mathbf{\Sigma})
\] where \[
\mathbf{X} = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}, \quad \mathbf{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \quad \mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{pmatrix}
\]
Covariance Matrix
- The covariance matrix \(\mathbf{\Sigma}\) looks like: \[
\mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{12} & \sigma_{22} \end{pmatrix}
\]
- If the covariance matrix is diagonal, then the variables are independent and separable.
Properties
- The covariance matrix is always symmetric.
- Linear transformations of a multivariate normal variable are still multivariate normal: \[
\mathbf{Y} = \mathbf{C} \mathbf{X} + \mathbf{b} \sim N(\mathbf{C} \mathbf{\mu} + \mathbf{b}, \mathbf{C} \mathbf{\Sigma} \mathbf{C}^T)
\]
- The dimension can change under linear transformations.
Limit Theorems
- Let \(X_1, X_2, \ldots, X_n\) be i.i.d. with finite mean \(\mu\) and variance \(\sigma^2\): \[
E|X| < \infty
\] then the mean of \(X_i\): \[
\bar{X}_n \text{ converges to } \mu \text{ almost surely (strong law of large numbers)}
\]
- For binary \(X_i\), \(\bar{X}_n\) reduces to the sample proportion \(p\). As \(n \to \infty\): \[
p \sim N\left(p, \frac{p(1-p)}{n}\right)
\]
- Standard error of the sample mean: \[
\frac{\sigma}{\sqrt{n}}
\]
Statistics
- Big \(X\) represents a random variable.
- Small \(x\) represents observed data.
- A statistic is a function \(T_n(X_1, X_2, \ldots, X_n)\).
- The empirical distribution function (ecdf) is: \[
T_n = F_n(x) = \frac{1}{n} \sum_{i=1}^{n} I(X_i \leq x)
\] which estimates \(F(x) = P(X \leq x)\).
- Sample error: \[
\sqrt{\frac{F(x)(1-F(x))}{n}} \text{ with upper bound } \frac{0.5}{\sqrt{n}}
\]
- A quantile is the inverse of the CDF: \[
x_q = F^{-1}(q)
\]
- \(x_q\) is not unique.
- \(x_q = \min \{x: F(x) \geq q\}\)
- \(q \in [0, 1]\) but \(x_q\) is in the domain.
- Standard error for quantiles: \[
\text{se}(x_q) = \sqrt{\frac{q(1-q)}{n f^2(x_q)}}
\] where \(f(x_q)\) is the density at \(x_q\).
Bias
- A statistic \(T_n\) is an unbiased estimator of \(\theta\) if: \[
E(T_n) = \theta
\]
- It is asymptotically unbiased if: \[
\lim_{n \to \infty} E(T_n) = \theta
\]
- Sample variance is unbiased, but variance is biased.
- Bias: \[
\text{Bias} = E(T_n) - \theta
\]
- Mean Squared Error (MSE): \[
\text{MSE} = E((T_n - \theta)^2) = \text{variance} + \text{bias}^2
\]
- Standard error of estimated properties: \[
\text{se}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
\]
Maximum Likelihood Estimation (MLE)
- Probability of the observations viewed as a function of the parameter:
- If \(X_1, X_2, \ldots, X_n\) are i.i.d.: \[
L(\theta) = f(x_1, x_2, \ldots, x_n; \theta) = \prod_{i=1}^{n} f(x_i; \theta)
\]
- The MLE is the \(\theta\) that maximizes this.
- MLE is usually done in the log form.
- MLEs are invariant, meaning if \(\theta\) is the MLE of \(\theta\), then \(g(\theta)\) is the MLE of \(g(\theta)\).
Bayes’ Theorem
- Law of total probability:
- If events \(A_1, A_2, \ldots, A_n\) are disjoint and cover the sample space, then: \[
P(B) = \sum_{i=1}^{n} P(A_i)P(B|A_i)
\]
- Bayes’ theorem: \[
P(A|B) = \frac{P(B|A)P(A)}{P(B)}
\]
- For continuous random variables: \[
f(y|x) = \frac{f(x|y)f(y)}{f(x)} = \frac{f(x|y)f(y)}{\int f(x|y)f(y) \, dy}
\]
Frequentist vs Bayesian
- Frequentists think about \(\theta\) as fixed and data as random.
- Bayesians think about \(\theta\) as random and data as fixed.
- Frequentists maximize likelihood.
- Bayesians maximize the posterior.
- Bayesians specify a prior distribution for the parameter \(f(\theta)\): \[
\max f(\theta|x) = \frac{f(x|\theta)f(\theta)}{f(x)} = \max \frac{L(\theta)f(\theta)}{\int L(\theta)f(\theta) \, d\theta}
\]
- The denominator has no \(\theta\), so we can ignore it.
- Posterior = Likelihood * Prior
Markov Chains
- A Markov Chain is a stochastic process \(X_t\) starting at \(X_0\) and making successive transitions.
- States are \(0, 1, 2, \ldots\). The process is a Markov Chain if: \[
P(X_{t+1} = j | X_t = i, X_{t-1} = i_{t-1}, \ldots, X_0 = i_0) = P(X_{t+1} = j | X_t = i)
\] for all states \(i, j\). Transition only depends on the current state.
- A Markov Chain is irreducible if all states are accessible from all other states.
- Given a state, there is a non-zero probability to go anywhere in finite time.
- It is homogeneous if the transition matrix is constant.
- Recurrent if it returns to the state with probability 1 and transient if it does not.
- If the expected time until the chain returns to the state is finite, then the state is called non-null or positive recurrent.
- Aperiodic if it does not cycle in a fixed number of steps.
- Ergodic if it is aperiodic and positive recurrent (returns in finite expected time).
- Stationary distribution:
- For an irreducible, ergodic Markov Chain, the transition probabilities converge to a stationary distribution \(\pi\) independent of the initial state: \[
\pi = \lim_{t \to \infty} P_{ij}^n
\]
- \(\pi\) is the limiting probability of being in state \(j\).