-
Notifications
You must be signed in to change notification settings - Fork 0
/
Chapter_7_Non-Linear_Modeling.Rmd
255 lines (167 loc) · 9.59 KB
/
Chapter_7_Non-Linear_Modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Chapter_7_Non-Linear_Modeling"
author: "Russ Conte"
date: "8/7/2021"
output: html_document
---
## 7.8 Lab: Non—Linear Modeling
In this lab, we re-analyze the Wage data considered in the examples through- out this chapter, in order to illustrate the fact that many of the complex non-linear fitting procedures discussed can be easily implemented in R. We begin by loading the ISLR library, which contains the data.
```{r}
library(ISLR)
attach(Wage)
```
## 7.8.1 Polynomial Regression and Step Functions
```{r}
fit = lm(wage~poly(age, 4), data = Wage)
coef(summary(fit))
```
from the text:
we can also use poly() to obtain age, age^2, age^3 and age^4 directly, if we prefer. We can do this by using the raw=TRUE argument to the poly() function.
```{r}
fit2 = lm(wage~poly(age, 4, raw = T), data = Wage)
coef(summary(fit2))
```
from the text:
There are several other equivalent ways of fitting this model, which show- case the flexibility of the formula language in R. For example:
```{r}
fit2a = lm(wage~age + I(age^2) + I(age^3) + I(age^4), data = Wage)
coef(fit2a)
```
This simply creates the polynomial basis functions on the fly, taking care to protect terms like age^2 via the wrapper function I()
We now create a grid of values for age at which we want predictions, and then call the generic predict() function, specifying that we want standard errors as well.
```{r}
agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
preds = predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
# next we plot the data and add the fit from the degree-4 polynomial
par(mfrow = c(1,2),mar = c(4.5, 4.5, 1,1), oma = c(0, 0, 4, 0))
plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Degree-4 Polynomial", outer = TRUE)
lines(age.grid, preds$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands,lwd = 1, col = "blue", lty = 3)
```
Here the mar and oma arguments to par() allow us to control the margins of the plot, and the title() function creates a figure title that spans both subplots.
In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests. We now fit models ranging from linear to a degree-5 polynomial and seek to determine the simplest model which is sufficient to explain the relationship wrapper between wage and age. We use the anova() function, which performs an analysis of variance (ANOVA, using an F-test) in order to test the null hypothesis that a model M1 is sufficient to explain the data against the alternative hypothesis that a more complex model M2 is required. In order to use the anova() function, M1 and M2 must be nested models: the predictors in M1 must be a subset of the predictors in M2. In this case, we fit five different models and sequentially compare the simpler model to the more complex model.
```{r}
fit.1 = lm(wage~age, data = Wage)
fit.2 = lm(wage~poly(age, 2), data = Wage)
fit.3 = lm(wage~poly(age, 3), data = Wage)
fit.4 = lm(wage~poly(age, 4), data = Wage)
fit.5 = lm(wage~poly(age, 5), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
```
The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero (<10−15), indicating that a linear fit is not sufficient. Sim- ilarly the p-value comparing the quadratic Model 2 to the cubic Model 3 is very low (0.0017), so the quadratic fit is also insufficient. The p-value comparing the cubic and degree-4 polynomials, Model 3 and Model 4, is ap- proximately 5 % while the degree-5 polynomial Model 5 seems unnecessary because its p-value is 0.37. Hence, either a cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower- or higher-order models are not justified.
## Logistic regression using polynomial functions
Next we consider the task of predicting whether an individual earns more than $250,000 per year. We proceed much as before, except that first we create the appropriate response vector, and then apply the glm() function using family="binomial" in order to fit a polynomial logistic regression model.
```{r}
fit = glm(I(wage>250)~poly(age, 4), data = Wage, family = binomial)
```
## Once again, we make predictions using the predict() function.
```{r}
preds = predict(fit, newdata = list(age = age.grid), se = T)
# In order to obtain the predictions, we have to transform the results
pfit = exp(preds$fit) / (1 + exp(preds$fit))
se.bands.logit = cbind(preds$fit + 2*preds$se.fit, preds$fit - 2*preds$se.fit)
se.bands = exp(se.bands.logit) / (1 + exp(se.bands.logit))
```
Finally, the right-hand plot from Figure 7.1 was made as follows:
```{r}
plot(age, I(wage>250), xlim = agelims, type = "n", ylim = c(0, 0.2))
points(jitter(age), I((wage>250) / 5), cex = 0.5, pch = "|", col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 1, col = "blue", lty = 3)
```
## step function
In order to fit a step function, as discussed in Section 7.2, we use the cut() function.
```{r}
table(cut(age, 4))
fit = lm(wage~cut(age, 4), data = Wage)
coef(summary(fit))
```
## 7.8.2 Splines
In order to fit regression splines in R, we use the splines library. In Section
7.4, we saw that regression splines can be fit by constructing an appropriate matrix of basis functions. The bs() function generates the entire matrix of basis functions for splines with the specified set of knots. By default, cubic bs() splines are produced. Fitting wage to age using a regression spline is simple:
```{r}
library(splines)
fit = lm(wage~bs(age, knots = c(25, 40, 60)), data = Wage)
pred = predict(fit, newdata = list(age = age.grid), se = T)
plot(age, wage, col = "gray")
lines(age.grid, pred$fit, lwd = 2)
lines(age.grid, pred$fit + 2*pred$se, lty = "dashed")
lines(age.grid, pred$fit - 2*pred$se, lty = "dashed")
```
In order to instead fit a natural spline, we use the ns() function. Here we fit a natural spline with four degrees of freedom.
```{r}
fit2 = lm(wage~ns(age, df = 4), data = Wage)
pred2 = predict(fit2, newdata = list(age = age.grid), se = T)
lines(age.grid, pred2$fit, col = "red", lwd = 2)
```
In order to fit a smoothing spline, we use the smooth.spline() function. Figure 7.8 was produced with the following code:
```{r}
plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Smoothing Splines")
fit = smooth.spline(age, wage, df = 16)
fit2 = smooth.spline(age, wage, cv = TRUE)
fit2$df
lines(fit, col = "red", lwd = 2)
```
In order to perform local regression, we use the loess() function.
```{r}
plot(age, wage, xlim = agelims, cex = 0.5, col = "darkgrey")
title("Local Regression")
fit = loess(wage~age, span = 0.2, data = Wage)
fit2 = loess(wage~age, span = 0.5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)), col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)), col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"), col = c("red", "blue"), lty = 1, lwd = 2, cex = 0.8)
```
## 7.8.3 GAMs
We now fit a GAM to predict wage using natural spline functions of year and age, treating education as a qualitative predictor, as in (7.16). Since this is just a big linear regression model using an appropriate choice of basis functions, we can simply do this using the lm() function.
```{r}
gam1 = lm(wage~ns(year, 4) + ns(age, 5) + education, data = Wage)
```
## fit the model with smoothing splines, rather than natural splines
```{r}
library(gam)
gam.m3 = gam(wage~s(year, 4) + s(age, 5) + education, data = Wage)
par(mfrow = c(1,3))
plot(gam.m3, se = TRUE, col = "blue")
```
In these plots, the function of year looks rather linear. We can perform a series of ANOVA tests in order to determine which of these three models is best: a GAM that excludes year (M1), a GAM that uses a linear function of year (M2), or a GAM that uses a spline function of year (M3).
```{r}
gam.m1 = gam(wage~s(age, 5) + education, data = Wage)
gam.m2 = gam(wage~year + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
```
We find that there is compelling evidence that a GAM with a linear func- tion of year is better than a GAM that does not include year at all (p-value = 0.00014). However, there is no evidence that a non-linear func- tion of year is needed (p-value = 0.349). In other words, based on the results of this ANOVA, M2 is preferred.
The summary function produces a summary of the gam fit:
```{r}
summary(gam.m3)
```
We can make predictions from gam objects, just like from lm objects, using the predict() method for the class gam. Here we make predictions on the training set
```{r}
preds = predict(gam.m2, newdata = Wage)
```
We can also use local regression fits as building blocks in a GAM, using
the lo() function.
```{r}
gam.lo = gam(wage~s(year, df = 4) + lo(age, span = 0.7) + education, data = Wage)
plot.Gam(gam.lo, se = TRUE, col = "green") # typo in the text, should be plot.Gam, not plot.gam
```
In order to fit a logistic regression GAM, we once again use the I() func-
tion in constructing the binary response variable, and set family=binomial.
```{r}
gam.lr = gam(I(wage >250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage)
par(mfrow = c(1,3))
plot(gam.lr, se = T, col = "green")
```
It is easy to see that there are no high earners in the <HS category:
```{r}
table(education, I(wage >250))
```
Hence, we fit a logistic regression GAM using all but this category. This provides more sensible results.
```{r}
gam.lr.s = gam(I(wage>250) ~ year + s(age, df = 5) + education, family = binomial, data = Wage, subset = (education!="1. < HS Grad"))
plot(gam.lr.s, se = T, col = "green")
```