-
Notifications
You must be signed in to change notification settings - Fork 35
/
issues.Rmd
192 lines (116 loc) · 13.3 KB
/
issues.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
# Issues
```{r setupIssues, include=FALSE}
# knitr::opts_chunk$set(cache.rebuild=F, cache = T)
```
Here we'll cover some additional things to consider.
## Estimation
As noted previously, estimation of GAMs in the <span class="pack">mgcv</span> package is conducted via a penalized likelihood approach. More detail can be found in the [technical section][Technical details], but conceptually this amounts to fitting the following model:
$$g(\mu) = f(x_1) + f(x_2) ... f(x_j)$$
But note that each smooth has its own model matrix made up of the basis functions. So for each smooth feature $j$ we have:
$$f_j = \tilde{X}_j \tilde{\beta}_j$$
Given a matrix of known coefficients $S$, we can more formally note a penalized likelihood function:
$$l_p(\beta)=\displaystyle l(\beta) - \frac{1}{2}\sum_j\lambda_j \beta^\mathrm{T} S_j\beta$$
where $l(\beta)$ is the usual GLM likelihood function, and $\lambda_j$ are the smoothing parameters. The part of the function including $\lambda$ penalizes curvature in the function, where $\lambda$ establishes a trade-off between the goodness of fit and the smoothness, and such an approach will allow for less overfitting. As $\lambda\rightarrow\infty$, we'd have a linear estimate for $f_j$, while $\lambda = 0$ would allow any $f$ that interpolates the data[^lambdaset]. Technically we could specify the smoothing parameters explicitly, and the [Appendix][a detailed example] has some 'by-hand' code taken directly from @wood_generalized_2006 with only slight modifications, where the smoothing parameters are chosen and compared.
Smoothing parameters however are in fact estimated rather than arbitrarily set, and this brings us back to the cross-validation procedure mentioned before. Smoothing parameters are selected which minimize the GCV score by default, though one has other options, e.g. using REML as with mixed models. Note that there are other approaches to estimation such as backfitting, generalized smoothing splines and Bayesian.
### Shrinkage & Variable Selection
Some smooths are such that no matter the smoothing parameter, there will always be non-zero coefficients for the basis functions. An extra penalty may be added such that if the smoothing parameter is large enough, the coefficients will shrink to zero, and some smoothing bases will have such alternative approaches available[^selectarg]. In this manner, one can assess whether a predictor is adding anything to the model, i.e. if it's effective degrees of freedom is near zero, and perhaps use the approach as a variable selection technique.
## Choice of Smoothing Function
A number of smooths are available with the <span class="pack">mgcv</span> package, and one can learn more via the help file for <span class="func">smooth.terms</span> ([link](https://stat.ethz.ch/R-manual/R-devel/library/mgcv/html/smooth.terms.html)). In our models, we have used cubic regression splines and thin plate regression splines (TPRS), the latter being the default for a GAM in this package. As a brief summary, TPRS work well in general in terms of performance and otherwise has some particular advantages, and has a shrinkage alternative available. One should still feel free to play around, particularly when dealing with multiple smooths, where the tensor product smooths would be better for covariates of different scales.
## Diagnostics
We have some built-in abilities to examine whether there are any particular issues, and we can try it with our second GAM model from the [application section][application using r].
```{r gam_check, fig.show='hide', eval=FALSE}
gam.check(mod_gam2, k.rep = 1000)
```
```{r gam_check-show, fig.show='hide', echo=FALSE}
test = capture.output(gam.check(mod_gam2, k.rep = 1000))
cat(paste0(test, collapse = '\n'))
# par(bg = 'transparent')
# gam.check(
# mod_gam2,
# k.rep = 1000,
# bty = 'n',
# pch = 20,
# cex = 1,
# col = scales::alpha('#D55E00', .75),
# cex.main = .75,
# cex.lab = .75,
# rl.col = scales::alpha('#56B4E9', .5),
# border = 'gray90',
# breaks = 20,
# col.axis = 'gray50',
# col.lab = 'gray25',
# col.main = 'gray10'
# )
```
This visualization is from <span class="pack">visibly's</span> <span class="func">plot_gam_check</span>.
```{r gam_check_show, echo=FALSE, fig.asp=.75}
plot_gam_check(mod_gam2, scatter = TRUE)
```
The plots are of the sort we're used to from a typical regression setting, though it's perhaps a bit difficult to make any grand conclusion based on such a small data set. The printed output on the other hand contains unfamiliar information, but is largely concerned with over-smoothing, and so has tests of whether the basis dimension for a smooth is too low. The p-values are based on simulation, so I bumped up the number with the additional argument. Guidelines are given in the output itself, and at least in this case it does not look like we have an issue. However, if there were a potential problem, it is suggested to double `k` parameter [^kpar] and refit, and if the effective degrees of freedom increases quite a bit you would probably want to go with the updated model[^incK].
[^kpar]:The `k` can be set as an argument to `s(var1, k=?)`.
Given the penalization process, the exact choice of $k$ isn't too big of a deal, and can be seen as an upper limit to the flexibility of the term. The actual flexibility is determined via the penalization process. However, the defaults are arbitrary. You want to set it large enough to get at the true effect as best as possible, but in some cases computational efficiency will also be of concern. For example, in fairly complex models with many predictors, interactions, etc., it might be worthwhile to reduce $k$ at the outset. The help file for the function <span class="func">choose.k</span> provides another approach to examining $k$ based on the residuals from the model under consideration, and provides other useful information.
### Concurvity
*Concurvity* refers to the generalization of collinearity to the GAM setting[^noconcurv]. In this case it refers to the situation where a smooth term can be approximated by some combination of the others. It largely results in the same problem as elsewhere, i.e. unstable estimates.
Wood provides three indices related to concurvity via the concurvity function, all range from 0 to 1 with 0 suggesting no problem, and 1 indicating that the function lies entirely in the space of one or more of the other smooth terms. See `?concurvity` for details.
```{r concurvity, eval=FALSE}
concurvity(mod_gam2)
```
```{r concurvity_show, echo=FALSE}
concurvity(mod_gam2) %>%
as_tibble(rownames = 'type') %>%
gt()
```
<br>
It should probably come as little surprise that we may have an issue here given the nature of the covariates. We can certainly make assumptions about wealthier nations' education and health status, for example. What can we do? Collinearity does not lead to biased estimates, only less stable ones, and the inflated variance can potentially be overcome with more data. We certainly can't do that here as we are dealing with country level data. That may also provide the solution though, since there is nothing to generalize to, as we have the population of interest (all countries with PISA scores). In other data settings however, we may need to think hard about what to include in the model to avoid such redundancy, take dimension reduction steps beforehand, or use some other selection technique. For example, if it is the result of having several [time- or spatially- covarying predictors][Time and Space], one might be able retain only those that best capture that effect. However, the <span class="pack">mgcv</span> estimation procedures have been developed with such issues in mind, and one can feel fairly confident in the results even in the presence of concurvity. See @wood2008 for additional detail.
## Prediction
A previous example used the predict function on the data used to fit the model to obtain fitted values on the response scale. We'd typically use this on new data. I do not cover it, because the functionality is the same as the <span class="func">predict.glm</span> function in base R, and one can just refer to that. It is worth noting again that there is an option, `type = 'lpmatrix'`, which will return the actual model matrix by which the coefficients must be pre-multiplied to get the values of the linear predictor at the supplied covariate values. This can be particularly useful towards opening the [black box][The Black Box] as one learns the technique.
## Model Comparison Revisited
We have talked about automated smoothing parameter and term selection, and in general, potential models are selected based on estimation of the smoothing parameter. Using an extra penalty to allow coefficients to tend toward zero with the argument `select = TRUE` is an automatic way to go about it, where some terms could effectively drop out. Otherwise we could compare models GCV/AIC scores[^gcvdifffam], and in general, either of these would be viable approaches. Consider the following comparison:
```{r mod_comparison}
mod_1d = gam(Overall ~ s(Income) + s(Edu), data = pisa)
mod_2d = gam(Overall ~ te(Income, Edu, bs = "tp"), data = pisa)
AIC(mod_1d, mod_2d)
```
```{r mod_comparison_show, echo=FALSE}
AIC(mod_1d, mod_2d) %>%
as_tibble(rownames = 'model') %>%
gt()
```
<br>
In some cases, we might prefer to be explicit in comparing models with and without particular terms, and we can go about comparing models as we would with a typical GLM analysis of deviance. We have demonstrated this previously using the <span class="func">anova.gam</span> function, where we compared linear fits to a model with an additional smooth function. While we could construct a scenario that is identical to the GLM situation for a statistical comparison, it should be noted that in the usual situation the test is actually an approximation, though it should be close enough when it is appropriate in the first place. The following provides an example that would nest the main effects of Income and Education within the product smooth, i.e. sets their basis dimension and smoothing function to the defaults employed by the tensor product smooth.
```{r anova_nest, results='hide'}
mod_A = gam(
Overall ~
s(Income, bs = "cr", k = 5)
+ s(Edu, bs = "cr", k = 5),
data = pisa
)
mod_B = gam(
Overall ~
ti(Income, bs = "cr", k = 5)
+ ti(Edu, bs = "cr", k = 5)
+ ti(Income, Edu, bs = 'cr'),
data = pisa
)
anova(mod_A, mod_B, test = "Chisq")
```
```{r anova_nest_show, echo=FALSE}
anova(mod_A, mod_B, test = "Chisq") %>%
as_tibble() %>%
mutate(model = c('mod_A', 'mod_B')) %>%
relocate(model) %>%
gt()
```
<br>
Again though, we could have just used the summary output from the second model.
Instances where such a statistical test does not appear to be appropriate within the context of the <span class="pack">mgcv</span> package are when terms are able to be penalized to zero; in such a case p-values will be much too low. In addition, when comparing GAMs, sometimes the nesting of models would not be so clear when there are multiple smooths involved, and additional steps may need to be taken to make sure they are nested to use the statistical test. We must make sure that each smooth term in the null model has no more effective degrees of freedom than the same term in the alternative, otherwise it's possible that the model with more terms can have lower effective degrees of freedom but better fit, rendering the test nonsensical. Wood suggests that if such model comparison is the ultimate goal, an unpenalized approach[^unpenal] would be best to use in order to have much confidence in the p-values[^gsspack].
## Big Data
Many times you'll come across some notably large data for which you'd still like to fit a GAM too. Wood has implemented a lot of speed and memory efficiency into the <span class="func">bam</span> function, which works basically the same as the gam function. Assuming you get the data into memory, even millions of rows will likely pose little issue for mgcv.
Where things do start to hurt computationally is in the number of parameters. While <span class="pack">mgcv</span> is great just as a [mixed model tool to incorporate random effects][A comparison to mixed models], as the dimensionality grows, <span class="func">bam</span>'s performance will start to suffer. This would likely not be a big deal for standard implementation of GAM for smooth terms, but when using it for categorical random effects (i.e. `bs = 're'`), having a lot of groups means a lot of parameters to estimate. I have a [demo that provides more details](https://m-clark.github.io/posts/2019-10-20-big-mixed-models/).
[^lambdaset]: See p. 128, @wood_generalized_2006, or p. 168 in @wood_generalized_2017.
[^selectarg]: See also, the `select` argument to the <span class="func">gam</span> function.
[^incK]: I actually did this for Health, just for demonstration, and there was no change at all; it still reduced to a linear effect.
[^noconcurv]: The topic is strangely absent in Wood's text.
[^gcvdifffam]: GCV scores are not useful for comparing fits of different families; AIC is still okay though.
[^unpenal]: This can be achieved with the argument `s(..., fx=T)`, although now one has to worry more about the k value used, as very high k will lead to low power.
[^gsspack]: One might also examine the <span class="pack">gss</span> package for an ANOVA based approach to generalized smoothing splines.