introduction.Rmd

# (PART\*) Part I: Concepts {-}

```{r setupIntro, include=FALSE}
# knitr::opts_chunk$set(cache.rebuild=F, cache = T)
```

# Introduction

## Beyond the General Linear Model I

### General Linear Model

Let's get started by considering the *standard linear regression model* (SLiM) estimated via ordinary least squares (OLS).  We have some response or target variable we wish to study, and believe it to be some function of other variables.  In terms of the underlying data generating process, $y$ is the variable of interest, assumed to be normally distributed with mean $\mu$ and variance $\sigma^2$, and the Xs are the features/covariates in this scenario.  The features are multiplied by the weights/coefficients ($\beta$) and summed, giving us the linear predictor, which in this case also directly provides us the estimated fitted values.  The following is a variant of the way this type of model is presented in introductory texts.

$$y\sim \mathcal{N}(\mu,\sigma^2)$$
$$\mu = b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+ b_p\cdot x_p$$


Here is an example of how the R code would look like for such a model.

```{r examplelmcode, eval=F}
mymod = lm(y ~ x1 + x2, data = mydata)
```

One of the issues with this model is that, in its basic form it can be very limiting in  its assumptions about the data generating process for the variable we want to study.  It also very typically will not capture what is going on in a great many data situations. 


### Generalized Linear Model

In that light, we may consider the *generalized linear model*.  Generalized linear models incorporate other types of distributions[^expofam], and include a link function $g(.)$ relating the mean $\mu$, or stated differently, the expected values $E(y)$, to the linear predictor $X\beta$, often denoted $\eta$.  So the general form is:

$$g(\mu) = \eta = X\beta$$
$$E(y) = \mu = g^{-1}(\eta)$$

Consider again the typical linear regression.  In that situation, we assume a Gaussian (i.e. normal) distribution for the response, we assume equal variance for all observations, and that there is a direct link of the linear predictor and the expected value $\mu$, i.e. $\mu = X\beta$. As such, the typical linear regression model is a generalized linear model with a Gaussian distribution and 'identity' link function.

To further illustrate the generalization, we consider a distribution other than the Gaussian.  Here we'll examine a Poisson distribution for some vector of count data.  There is only one parameter to be considered, $\mu$, since for the Poisson the mean and variance are equal.  For the Poisson, the (canonical) link function $g(.)$, is the natural log, and so relates the log of $\mu$ to the linear predictor.  As such we could also write it in terms of exponentiating the right-hand side.

$$y \sim \mathcal{P}(\mu)$$
$$\ln(\mu) = b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+ b_p\cdot x_p$$
$$\mu = e^{b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+b_p\cdot x_p}$$


While there is a great deal to further explore with regard to generalized linear models, the point here is to simply to note them as a generalization of the typical linear model that we would all be familiar after an introductory course in modeling/statistics. As we eventually move to generalized additive models, we can see them as a subsequent step in the  generalization.


### Generalized Additive Model

Now let us make another generalization to incorporate nonlinear forms of the features, via a *generalized additive model*.  This form gives the new setup relating our new, now nonlinear predictor to the expected value, with whatever link function may be appropriate.

$$
y \sim ExpoFam(\mu, etc.) \\
E(y) = \mu \\
g(\mu) = b_0 + f(x_1) + f(x_2) \;...\;+f(x_p)
$$
    

So what's the difference?  In short, we are using smooth functions of our feature variables, which can take on a great many forms, with more detail on what that means in the following section. For now, it is enough to note the observed values $y$ are assumed to be of some exponential family distribution, and $\mu$ is still related to the model predictors via a link function.  The key difference is that the linear predictor now incorporates smooth functions of at least some (possibly all) features, represented as $f(x)$, and this will allow for nonlinear relationships between the features and the target variable $y$.


## Beyond the General Linear Model II

### Fitting the Standard Linear Model

In many data situations, the relationship between some covariate(s) $X$ and response $y$ is not so straightforward. Consider the following plot. Attempting to fit a standard OLS regression results in the blue line, which doesn't capture the relationship as well as we would like. 

```{r fitPlotsDataSetup, echo=F}
set.seed(666)
x = rgamma(1000, 2)
y = x + x ^ 2 + rnorm(100)

dxy = data.frame(x, y) %>% arrange(x)

fitlm = lm(y ~ x, data = dxy)$fitted
fitx2 = lm(y ~ poly(x, 2), data = dxy)$fitted
#loess
marker = which(round(sort(x), 2) == 3)
neighborhood = (marker - 100):(marker + 100)

neighbormod = lm(y ~ x, dxy[neighborhood, ])
neighborfit = fitted(neighbormod)
# neighborfit[names(neighborfit)==paste(marker)]
fitat3 = predict(neighbormod, newdat = data.frame(x = 3, y = NA))
```


```{r linearFitPlot, echo=F}
dxy %>% 
  ggplot(aes(x,y)) +
  geom_point() + 
  geom_line(aes(y = fitlm), size = 1) 
```

### Polynomial Regression


One common approach we could undertake is to add a transformation of the predictor $X$, and in this case we might consider a quadratic term such that our model looks something like the following.

$$
y \sim \mathcal{N}(\mu,\sigma^2)\\
\mu = b_0 + b_1\cdot x_1+b_2\cdot x_1^2
$$
And here is how the model would fit the data.

```{r quadraticFitPlot, echo=F}
dxy %>% 
  ggplot(aes(x,y)) +
  geom_point() + 
  geom_line(aes(y = fitx2), size = 1) 
```

<br>

We haven't really moved from the standard linear model in this case, but we have a much better fit to the data as evidenced by the graph.


### Scatterplot Smoothing

There are still other possible routes we could take. Many are probably familiar with *loess* (or lowess) regression, if only in the sense that often statistical packages may, either by default or with relative ease, add a nonparametric fit line to a scatterplot. By default, this is typically *lowess*, or *locally weighted scatterplot smoothing*.  

Take a look at the following figure. For every (sorted) $x_0$ value, we choose a neighborhood around it and, for example, fit a simple regression.  As an example, let's look at $x_0$ = 3.0, and choose a neighborhood of 100 X values below and 100 values above.  

```{r loessgraph1, echo=FALSE}
# because plotly can't do calculations on the fly for some things
minx = min(sort(x)[neighborhood])
maxx = max(sort(x)[neighborhood])
maxy = max(y) + 2

dxy %>% 
  ggplot(aes(x, y)) +
  geom_point(color = 'gray', alpha = .25) +
  geom_vline(aes(xintercept = x), data = . %>% slice(neighborhood) %>% slice(1)) +
  geom_vline(aes(xintercept = x), data = . %>% slice(neighborhood) %>% slice_tail(n =1)) + 
  geom_point(color = '#B2001D', data = . %>% slice(neighborhood))
```

<br>

Now, for just that range we fit a linear regression model and note the fitted value where $x_0=3.0$. 

```{r loessgraph2, echo=FALSE}
dxy %>% 
  slice(neighborhood) %>% 
  ggplot(aes(x, y)) +
  geom_point(color = '#B2001D', alpha = .25) +
  geom_smooth(method = 'lm', se = FALSE, formula =  y ~ x)
```

<br>

If we now do this for every $x_0$ value, we'll have fitted values based on a rolling neighborhood that is relative to each value being considered. Other options we could throw into this process, and usually do, would be to fit a polynomial regression for each neighborhood, and weight observations the closer they are to the value, with less weight given to more distant observations.

```{r loessgraph3, echo=FALSE, cache=FALSE}
dxy %>% 
  ggplot(aes(x, y)) +
  geom_point(alpha = .25) +
  geom_smooth(method = 'lm', se = FALSE, formula =  y ~ x) +
  annotate(x = 10, y = 50, geom = 'text', label = 'lm', color = okabe_ito[2]) +
  geom_smooth(method = 'lm', se = FALSE, formula =  y ~ x) +
  geom_smooth(method = 'loess', color = okabe_ito[3], se = FALSE, formula =  y ~ x) +
  annotate(x = 8, y = 90, geom = 'text', label = 'loess', color = okabe_ito[3]) 
```

<br>

The above plot shows the result from such a fitting process.  For comparison, the regular regression fit is also provided. Even without using a lowess approach, we could have fit a model assuming a quadratic relationship,  $y = x + x^2$, and it would result in a far better fit than the simpler model[^yquad].  While polynomial regression might suffice in some cases, the issue is that nonlinear relationships are generally not specified so easily, as we'll see next[^popcrash].

```{r mcycleDataSetup, echo=FALSE}
library(splines)
data(mcycle, package = 'MASS')
attach(mcycle)
polyfit = fitted(lm(accel ~ poly(times, 3)))
nsfit   = fitted(lm(accel ~ ns(times, df = 5)))
ssfit   = fitted(smooth.spline(times, accel))
lowfit  = lowess(times, accel, .2)$y
ksfit   = ksmooth(times,
                  accel,
                  'normal',
                  bandwidth = 5,
                  x.points = times)$y #locpoly(times, accel, kernel='normal', bandwidth=10)

gamfit = fitted(mgcv::gam(accel ~ s(times, bs = "cs"))) #original drops duplicates
mcycle2 = data.frame(mcycle, polyfit, nsfit, ssfit, lowfit, ksfit, gamfit)

detach('mcycle')

mcycle3 =
  mcycle2 %>%
  pivot_longer(-c(times, accel), names_to = 'variable') %>%
  mutate(
    variable = factor(
      variable,
      labels = c(
        'GAM',
        'Kernel Smoother',
        'Lowess',
        'Natural Splines',
        'Polynomial Regression (cubic)',
        'Smoothing Splines'
      )
    ),
    variable = fct_relevel(
      variable,
      'Polynomial Regression (cubic)',
      'Natural Splines',
      'Kernel Smoother',
      'Lowess',
      'Smoothing Splines'
    )
  )
```


### Generalized Additive Models

The next figure regards a data set giving a series of measurements of head acceleration in a simulated motorcycle accident[^mcycle].  Time is in milliseconds, acceleration in g.  Here we have data that are probably not going to be captured with simple transformations of the predictors. We can compare various approaches, and the first is a straightforward cubic polynomial regression, which unfortunately doesn't help us much. We could try higher orders, which would help, but in the end we will need something more flexible, and generalized additive models can help us in such cases.  

```{r mcycleFigure, fig.asp = .75, echo=FALSE}
mcycleFigure = ggplot(aes(x = times, y = accel), data = mcycle3) +
  geom_point(color = '#D55E00',
             alpha = .25,
             size = 1) +
  geom_line(aes(y = value), color = '#56B4E9', size = .75) +
  facet_wrap( ~ variable) +
  ylab('Acceleration') +
  xlab('Time') +
  theme(plot.background = element_blank())

mcycleFigure
# ggplotly(mcycleFigure) %>%    # unlike plot_ly, ggplotly can't handle % width, but also just randomly disappears
#     theme_plotly()  
```


## Summary

Let's summarize our efforts up to this point.


- Standard Linear Model

$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$\mu = b_0+b_1\cdot x_1$$


- Polynomial Regression

$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$\mu = b_0+b_1\cdot x_1+b_2\cdot x^2$$

- GLM formulation

$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$g(\mu) = b_0+b_1\cdot x_1+b_2\cdot x_2$$ 

- GAM formulation

$$y\sim \mathcal{N}(\mu, \sigma^{2})$$
$$g(\mu) = f(X)$$  

Now that some of the mystery will hopefully have been removed, let's take a look at GAMs in action.


[^expofam]: Of the [exponential family](https://en.wikipedia.org/wiki/Exponential_family).
[^yquad]: $y$ was in fact constructed as $x + x^2 +$ noise.
[^popcrash]: Bolker 2008 notes an example of fitting a polynomial to 1970s U.S. Census data that would have predicted a population crash to zero by 2015.
[^mcycle]: This is a version of that found in @venables_modern_2002.