02.2-general-math.Rmd

## General Math

### Number Sets

| Notation     | Denotes          | Examples                  |
|--------------|------------------|---------------------------|
| $\emptyset$  | Empty set        | No members                |
| $\mathbb{N}$ | Natural numbers  | $\{1, 2, ...\}$           |
| $\mathbb{Z}$ | Integers         | $\{ ..., -1, 0, 1, ...\}$ |
| $\mathbb{Q}$ | Rational numbers | including fractions       |
| $\mathbb{R}$ | Real numbers     |                           |
| $\mathbb{C}$ | Complex numbers  |                           |

### Summation Notation and Series

**Chebyshev's Inequality** Let $X$ be a random variable with mean $\mu$ and standard deviation $\sigma$. Then for any positive number $k$:

$$
P(|X-\mu| < k\sigma) \ge 1 - \frac{1}{k^2}
$$

Chebyshev's Inequality does not require that $X$ be normally distributed

**Geometric sum**

$$
\sum_{k=0}^{n-1} ar^k = a\frac{1-r^n}{1-r}
$$

where $r \neq 1$

**Geometric series**

$$
\sum_{k=0}^\infty ar^k = \frac{a}{1-r}
$$

where $|r| <1$

**Binomial theorem**

$$
(x + y)^n = \sum_{k=0}^n \binom{n}{k} x^{n-k} y^k
$$

where $n \ge 0$

**Binomial series**

$$
\sum_k \binom{\alpha}{k} x^k  = (1 +x)^\alpha
$$

$|x| < 1$ if $\alpha \neq n \ge 0$

**Telescoping sum**

When terms of a sum cancel each other out, leaving one term (i.e., it collapses like a telescope), we call it a telescoping sum

$$
\sum_{a \le k < b} \Delta F(k) = F(b) - F(a)
$$

where $a \le b$ and $a, b \in \mathbb{Z}$

**Vandermonde convolution**

$$
\sum_k \binom{r}{k} \binom{s}{n-k} = \binom{r+s}{n}
$$

$n \in \mathbb{Z}$

**Exponential series**

$$
\sum_{k=0}^\infty \frac{x^k}{k!} = e^x
$$

where $x \in \mathbb{C}$

**Taylor series**

$$
\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k = f(x)
$$

where $|x-a| < R =$ radius of convergence

when $a = 0$, we have

**Maclaurin series expansion for**

$$
e^z = 1 + z + \frac{z^2}{2!} + \frac{z^3}{3!} + ...
$$

**Euler's summation formula**

$$\sum_{a \le k < b} f(k) = \int_a^b f(x) dx + \sum_{k=1}^m\frac{B_k}{k!} f^{(k-1)}(x) |_a^b \\+ (-1)^{m+1} \int^b_a \frac{B_m (x-|x|)}{m!} f^{(m)}(x)dx$$ where $a,b, c \in \mathbb{Z}$ and $a \le b, m \ge 1$

when $m = 1$, we have trapezoidal rule

$$
\sum_{a \le k < b} f(k) \approx \int_a^b f(x) dx - \frac{1}{2} (f(b) - f(a))
$$

### Taylor Expansion

A differentiable function, $G(x)$ can be written as an infinite sum of its derivatives.

More specifically, an infinitely differentiable $G(x)$ evaluated at $a$ is

$$
G(x) = G(a) + \frac{G'(a)}{1!} (x-a) + \frac{G''(a)}{2!}(x-a) + \frac{G'''(a)}{3!}(x-a)^3 + \dots
$$

### Law of large numbers

Let $X_1,X_2,...$ be an infinite sequence of independent and identically distributed (i.i.d)

Then, the sample average is

$$
\bar{X}_n =\frac{1}{n} (X_1 + ... + X_n)
$$

converges to the expected value ($\bar{X}_n \rightarrow \mu$) as $n \rightarrow \infty$

$$
Var(X_i) = Var(\frac{1}{n}(X_1 + ... + X_n)) = \frac{1}{n^2}Var(X_1 + ... + X_n)= \frac{n\sigma^2}{n^2}=\frac{\sigma^2}{n}
$$

Note: The connection between the LLN and the normal distribution lies in the [Central Limit Theorem]. The CLT states that, regardless of the original distribution of a dataset, the distribution of the sample means will tend to follow a normal distribution as the sample size becomes larger.

The difference between [Weak Law] and [Strong Law] regards the mode of convergence

#### Weak Law

The sample average converges in probability towards the expected value

$$
\bar{X}_n \rightarrow^{p} \mu
$$

when $n \rightarrow \infty$

$$
\lim_{n\to \infty}P(|\bar{X}_n - \mu| > \epsilon) = 0
$$

The sample mean of an iid random sample ($\{ x_i \}_{i=1}^n$) from any population with a finite mean and finite variance $\sigma^2$ iis a consistent estimator of the population mean $\mu$

$$
plim(\bar{x})=plim(n^{-1}\sum_{i=1}^{n}x_i) =\mu
$$

#### Strong Law

The sample average converges almost surely to the expected value

$$
\bar{X}_n \rightarrow^{a.s} \mu 
$$

when $n \rightarrow \infty$

Equivalently,

$$
P(\lim_{n\to \infty}\bar{X}_n =\mu) =1
$$

### Law of Iterated Expectation

Let $X, Y$ be random variables. Then,

$$
E(X) = E(E(X|Y))
$$

means that the expected value of X can be calculated from the probability distribution of $X|Y$ and $Y$

### Convergence

#### Convergence in Probability

-   $n \rightarrow \infty$, an estimator (random variable) that is close to the true value.
-   The random variable $\theta_n$ converges in probability to a constant $c$ if

$$
\lim_{n\to \infty}P(|\theta_n - c| \ge \epsilon) = 0
$$

for any positive $\epsilon$

Notation

$$
plim(\theta_n)=c 
$$

Equivalently,

$$
\theta_n \rightarrow^p c
$$

**Properties of Convergence in Probability**

-   Slutsky's Theorem: for a continuous function g(.), if $plim(\theta_n)= \theta$ then $plim(g(\theta_n)) = g(\theta)$

-   if $\gamma_n \rightarrow^p \gamma$ then

    -   $plim(\theta_n + \gamma_n)=\theta + \gamma$ + $plim(\theta_n \gamma_n) = \theta \gamma$ + $plim(\theta_n/\gamma_n) = \theta/\gamma$ if $\gamma \neq 0$

-   Also hold for random vectors/ matrices

#### Convergence in Distribution

-   As $n \rightarrow \infty$, the distribution of a random variable may converge towards another ("fixed") distribution.
-   The random variable $X_n$ with CDF $F_n(x)$ converges in distribution to a random variable $X$ with CDF $F(X)$ if

$$
\lim_{n\to \infty}|F_n(x) - F(x)| = 0
$$

at all points of continuity of $F(X)$

Notation $F(x)$ is the limiting distribution of $X_n$ or $X_n \rightarrow^d X$

-   $E(X)$ is the limiting mean (asymptotic mean)
-   $Var(X)$ is the limiting variance (asymptotic variance)

**Note**

$$
\begin{aligned}
E(X) &\neq \lim_{n\to \infty}E(X_n) \\
Avar(X_n) &\neq \lim_{n\to \infty}Var(X_n)
\end{aligned}
$$

**Properties of Convergence in Distribution**

-   Continuous Mapping Theorem: for a continuous function g(.), if $X_n \to^{d} g(X)$ then $g(X_n) \to^{d} g(X)$

-   If $Y_n\to^{d} c$, then

    -   $X_n + Y_n \to^{d} X + c$

    -   $Y_nX_n \to^{d} cX$

    -   $X_nY_n \to^{d} X/c$ if $c \neq 0$

-   also hold for random vectors/matrices

#### Summary

Properties of Convergence

| Probability                                                                                                         | Distribution                                                                                                 |
|-------------------------------------|-----------------------------------|
| Slutsky's Theorem: for a continuous function g(.), if $plim(\theta_n)= \theta$ then $plim(g(\theta_n)) = g(\theta)$ | Continuous Mapping Theorem: for a continuous function g(.), if $X_n \to^{d} g(X)$ then $g(X_n) \to^{d} g(X)$ |
| if $\gamma_n \rightarrow^p \gamma$ then                                                                             | if $Y_n\to^{d} c$, then                                                                                      |
| $plim(\theta_n + \gamma_n)=\theta + \gamma$                                                                         | $X_n + Y_n \to^{d} X + c$                                                                                    |
| $plim(\theta_n \gamma_n) = \theta \gamma$                                                                           | $Y_nX_n \to^{d} cX$                                                                                          |
| $plim(\theta_n/\gamma_n) = \theta/\gamma$ if $\gamma \neq 0$                                                        | $X_nY_n \to^{d} X/c$ if $c \neq 0$                                                                           |

[Convergence in Probability] is stronger than [Convergence in Distribution].

Hence, [Convergence in Distribution] does not guarantee [Convergence in Probability]

### Sufficient Statistics

**Likelihood**

-   describes the extent to which the sample provides support for any particular parameter value.
-   Higher support corresponds to a higher value for the likelihood
-   The exact value of any likelihood is **meaningless**,
-   The relative value, (i.e., comparing two values of $\theta$), is **informative**.

$$
L(\theta_0; y) = P(Y = y | \theta = \theta_0) = f_Y(y;\theta_0)
$$

**Likelihood Ratio**

$$
\frac{L(\theta_0;y)}{L(\theta_1;y)}
$$

**Likelihood Function**

For a given sample, you can create likelihoods for all possible values of $\theta$, which is called *likelihood function*

$$
L(\theta) = L(\theta; y) = f_Y(y;\theta)
$$

In a sample of size n, the likelihood function takes the form of a product

$$
L(\theta) = \prod_{i=1}^{n}f_i (y_i;\theta)
$$

Equivalently, the log likelihood function

$$
l(\theta) = \sum_{i=1}^{n} logf_i(y_i;\theta)
$$

**Sufficient statistics**

-   A statistic, $T(y)$, is any quantity that can be calculated purely from a sample (independent of $\theta$)
-   A statistic is **sufficient** if it conveys all the available information about the parameter.

$$
L(\theta; y) = c(y)L^*(\theta;T(y))
$$

**Nuisance parameters** If we are interested in a parameter (e.g., mean). Other parameters requiring estimation (e.g., standard deviation) are **nuisance** parameters. We can replace nuisance parameters in likelihood function with their estimates to create a **profile likelihood**.

### Parameter transformations

log-odds transformation

$$
Log odds = g(\theta)= ln[\frac{\theta}{1-\theta}]
$$