Skip to content

Latest commit

 

History

History
78 lines (54 loc) · 4.77 KB

01.Linear Regression.md

File metadata and controls

78 lines (54 loc) · 4.77 KB

Linear Regression

Linear regression is a technique used to model the relationships between observed variables in the context of linear models. A linear model is: $$ \hat{t}=y(\mathbf{x}, \mathbf{w})=w_0+\sum_{j=1}^{M-1} w_j x_j=\mathbf{w}^{\top} \mathbf{x}$$

A possible loss function (a way to evaluate the "quality" of my model) is Residual sum of squares : $$\qquad \operatorname{RSS}(\mathbf{w})=\sum_{n=1}^N\left(y\left(\mathbf{x}_n, \mathbf{w}\right)-t_n\right)^2 $$

Actually we need a model that is linear in the parameters and we can define a model which is linear not on the input variable but linear on the parameter (or feature) vector. Same formula as above but applied on feature vectors:

$$y={y}(\mathbf{x}, \mathbf{w})=w_0+\sum_{j=1}^{D-1} w_j \phi_j(x)=\mathbf{w}^T \mathbf{\phi_j(x)}$$ A features vector $\phi(x)$ of a 2d point can be for example:

$\phi([a, b])=[a, b, a b]^{\top}$.

Ordinary Least Squares

Ordinary Least Squares (OLS) is based on the idea of minimizing the sum of squared residuals. Residuals are the differences between the actual and predicted values of the output variable. The square is just used to avoid discrimination between positive and negative "residuals".

Compact way:

$$ L(\mathbf{w})=\frac{1}{2} R S S(\mathbf{w})=\frac{1}{2}(\mathbf{t}-\mathbf{\Phi} \mathbf{w})^T(\mathbf{t}-\mathbf{\Phi} \mathbf{w}) $$

If we compute the second derivative on this we obtain the OLS formula:

$$w_{ols}=(X^TX)^{-1}X^Tt$$ In the formula you provided, we have:

  • $w_{ols}$: This represents the vector of estimated coefficients or weights for each feature in our linear equation.
  • $X$: This represents the design matrix, which contains all the input features or independent variables from our dataset. Each row corresponds to one data point and each column corresponds to one feature.
  • $t$: This represents the target variable or dependent variable from our dataset.

This formula allows us to find an optimal solution by minimizing the sum of squared errors between predicted values (obtained using these estimated coefficients) and actual target values from our dataset. Because the OLS is not feasible with large dataset, in practice we will use a sequential learning approach: instead of trying to solve the equation, we compute the gradient on just a single datapoint or a batch (subset) of datapoints.

$$\begin{aligned} & L(\mathbf{x})=\sum_n L\left(x_n\right) \ & \Rightarrow \mathbf{w}^{(n+1)}=\mathbf{w}^{(n)}-\alpha^{(n)} \nabla L\left(x_n\right) \ & \Rightarrow \mathbf{w}^{(n+1)}=\mathbf{w}^{(n)}-\alpha^{(n)}\left(\mathbf{w}^{(n)^T} \boldsymbol{\phi}\left(\mathbf{x}_n\right)-t_n\right) \boldsymbol{\phi}\left(\mathbf{x}_n\right)\end{aligned}$$ where $\alpha$ is the learning rate. This is called a Stochastic Gradient Descent. Stochastic gradient descent, often abbreviated as SGD, is an iterative optimization algorithm used to minimize an objective function. It is widely employed in machine learning tasks such as regression and training neural networks.

Cool resource to visualize OLS: https://kwichmann.github.io/ml_sandbox/linear_regression_diagnostics/

Bayesian Linear Regression

While OLS is a frequentist approach, Bayesian Linear Regression is a probabilistic one. We define a model with unknown parameters and specify a prior distribution to account for our uncertainty about them:

$$ p(\text { parameters} \mid \text {data })=\frac{p(\text { data } \mid \text { parameters }) p(\text { parameters })}{p(\text { data })} $$

or written in a compact way:

$$ p(w \mid D)=\frac{p(D \mid w) p({w})}{p(D)} $$

  • $p(D \mid w)$ is the likelihood: the probability of observing the data $D$ given the parameters ($w$).
  • $p(\mathbf{w} | D)$ is the posterior probability of parameters $w$ given training data.
  • $p(D)$ is the marginal likelihood and acts as normalizing constant
  • $p(w)$ is our "prior knowledge" about the parameters. This is one of the fundamental differences with the OLS approach ... it's like having a prior hint about the parameters.

We can model the prior $p(w)$ as a gaussian: $$ p(\mathbf{w})=\mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_0, \mathbf{S}_0\right) $$ and also the likelihood as a gaussian so that the posterior probability $p(\mathbf{w} | D)$ will be gaussian:

$$ p\left(\mathbf{w} \mid \mathbf{t}, \mathbf{\Phi}, \sigma^2\right) \propto \mathcal{N}\left(\mathbf{w} \mid \mathbf{w}_0, \mathbf{S}_0\right) \mathcal{N}\left(\mathbf{t} \mid \mathbf{\Phi} \mathbf{w}, \sigma^2 \mathbf{I}\right) $$

In the same way we can use a Beta distribution as prior, a Bernoulli as likelihood so to have at end a posterior probability which is a Beta. This choices are useful to exploit this Bayesian approach for sequential learning: we compute posterior with initial data and later, adding additional data, the posterior becomes the prior (recursion).