-
Notifications
You must be signed in to change notification settings - Fork 85
/
Copy path5. Resampling Methods Exercises.Rmd
445 lines (323 loc) · 12 KB
/
5. Resampling Methods Exercises.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
---
title: "Ch.5 Exercises: Resampling Methods"
output:
html_document: default
pdf_document:
latex_engine: lualatex
word_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
#### __Conceptual__
__1.__
$$ \alpha = \frac{\sigma^2_Y - \sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}\qquad (5.6)$$
We have: $\sigma^2_X=Var(X)$, $\sigma^2_Y=Var(Y)$, $\sigma_{XY}=COV(XY)$
Using the following statistical property of variance:
$$Var(aX + bY) = a^2Var(X) + b^2Var(Y) + 2abCOV(XY)$$
We can show that $\alpha$ minimizes $Var(\alpha X+(1-\alpha)Y)$.
$$ \begin{aligned}
Var(aX + bY) &= a^2\sigma^2_X + b^2\sigma^2_Y + 2ab\sigma_{XY}\\
Var(\alpha X+(1-\alpha)Y) &= \alpha^2\sigma^2_X + (1-\alpha)^2\sigma^2_Y + 2(\alpha(1-\alpha))\sigma_{XY}\\
f &= \alpha^2\sigma^2_X+\sigma_Y^2-2\alpha\sigma^2_Y + \alpha^2\sigma^2_Y+2\alpha\sigma_{XY}-2\alpha^2\sigma_{XY}
\end{aligned} $$
Finding the critical points, simplifying and rearranging:
$$ \begin{aligned}
\frac{\partial(f)}{\partial\alpha} &= 0 \\
2\alpha\sigma^2_X - 2\sigma^2_Y + 2\alpha\sigma^2_Y + 2\sigma_{XY} - 4\alpha\sigma_{XY} &= 0\\
\alpha &= \frac{\sigma^2_Y - \sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}
\end{aligned} $$
__2.__ __(a)__
When using random sampling with replacement to obtain a bootstrap sample from n observations:
If n=3, with observations: 1,2,3 we have:
Total permutations: $n^n$ = $n^3$ = $27$.
Total times an observation will appear in a particular order(say as the first observation): $n^{n-1}$ = $n^2$ = $9$.
So, the probability 1st bootstrap observation is the jth observation is: $1/n$.
Therefore, the probability the 1st bootstrap observation is not the jth observation is: $1 - 1/n$.
__(b)__
$1 - 1/n$
__(c)__
Total times an observation will not appear in a bootstrap sample: $(n-1)^n$.
Therefore, the probability that an observation is not in the bootstrap sample: $\frac{(n-1)^n}{n^n}$ = $(1-1/n)^n$.
__(d)__
$p = 1 - (1-1/5)^5 = 0.672$
__(e)__
$p = 1 - (1-1/100)^{100} = 0.634$
__(f)__
$p = 1 - (1-1/10000)^{10000} = 0.632$
__(g)__
```{r}
n = 1:100000
p = 1-(1-1/n)^n
plot(n,p, main="Probability (p) vs n observations.")
```
- Probability decreases rapidly to an asymptote around 0.63 as n increases.
__(h)__
```{r}
store=rep(NA, 10000)
for(i in 1:10000){
store[i] = sum(sample(1:100, rep=TRUE)==4)>0
}
mean(store)
```
- The result is similar to the value calculated using the formula in 2(e).
__3.__ __(a)__
In k-fold cross validation, a training set is randomly divided into k groups of equal size. The first group (or fold) is treated as the validation set, and the model is fit on the remaining k-1 groups (training set).The MSE is calculated using the validation set. This procedure is repeated k times and on each occasion the validation set and training sets will be different than the previous one. We then take the average of all the MSE's as the final MSE.
__(b)__
i. Less variance in test error estimate. A more accurate test error, as entire dataset is used.
ii. Computational advantage, as less computing resources required than LOOCV. If K=10 then only 10 models need to be fitted, unlike with LOOCV where n models need fitting. Higher bias than LOOCV, as fewer observations are used, but tends to have lower variance. A choice of K=5 or 10 gives a test error estimate that has neither high bias or variance, and so tends to give more accurate estimates in general.
__(4)__
- Create a function that outputs the response Y given X.
- Use bootstrapping to make multiple predictions of Y and the associated Standard Error(SE).
- Calculate SD by using : $SD = SE\sqrt{n}$
### Applied
__5.__ __(a)__ __(b)__
```{r}
library(ISLR)
library(MASS)
```
```{r}
set.seed(1)
require(caTools)
df = Default
sample_data = sample.split(df$default, SplitRatio = 0.50)
training.set = subset(df, sample_data==TRUE)
test.set = subset(df, sample_data==FALSE)
test.default = test.set$default
# Logistic regression model
lr.fit = glm(default ~ income+balance, data=training.set, family=binomial)
lr.probs = predict(lr.fit,test.set, type="response")
lr.preds = rep("No", length(test.set$default))
lr.preds[lr.probs>0.5] = "Yes"
table(lr.preds,test.default)
```
- Test error rate of 2.5%.
__(c)__
```{r}
# Assuming the question is referring to separate random splits rather than reducing or increasing the validation set.
set.seed(12)
sample_data = sample.split(df$default, SplitRatio = 0.50)
training.set = subset(df, sample_data==TRUE)
test.set = subset(df, sample_data==FALSE)
test.default = test.set$default
# Logistic regression model
lr.fit = glm(default ~ income+balance, data=training.set, family=binomial)
lr.probs = predict(lr.fit,test.set, type="response")
lr.preds = rep("No", length(test.set$default))
lr.preds[lr.probs>0.5] = "Yes"
table(lr.preds,test.default)
```
- Test error rate of 2.7%
```{r}
set.seed(123)
sample_data = sample.split(df$default, SplitRatio = 0.50)
training.set = subset(df, sample_data==TRUE)
test.set = subset(df, sample_data==FALSE)
test.default = test.set$default
# Logistic regression model
lr.fit = glm(default ~ income+balance, data=training.set, family=binomial)
lr.probs = predict(lr.fit,test.set, type="response")
lr.preds = rep("No", length(test.set$default))
lr.preds[lr.probs>0.5] = "Yes"
table(lr.preds,test.default)
```
- Test error rate of 2.4%
```{r}
set.seed(1234)
sample_data = sample.split(df$default, SplitRatio = 0.50)
training.set = subset(df, sample_data==TRUE)
test.set = subset(df, sample_data==FALSE)
test.default = test.set$default
# Logistic regression model
lr.fit = glm(default ~ income+balance, data=training.set, family=binomial)
lr.probs = predict(lr.fit,test.set, type="response")
lr.preds = rep("No", length(test.set$default))
lr.preds[lr.probs>0.5] = "Yes"
table(lr.preds,test.default)
```
- Test error rate of 2.7%
- The results are similar to each other (low variance). The minor differences can be explained by the fact that we used separate observations for each model.
__(d)__
```{r}
# Logistic regression model, dummy variable used for student.
lr.fit = glm(default ~ income+balance+student, data=training.set, family=binomial)
lr.probs = predict(lr.fit,test.set, type="response")
lr.preds = rep("No", length(test.set$default))
lr.preds[lr.probs>0.5] = "Yes"
contrasts(Default$student)
table(lr.preds,test.default)
```
- Test error of 2.7%, which is unchanged.
__6.__ __(a)__ __(b)__
```{r}
set.seed(111)
# Logistic regression model
lr.fit2 = glm(default ~ income+balance, data=df, family=binomial)
summary(lr.fit2)$coefficients[2:3,2]
```
```{r}
boot.fn = function(data, index){
default = data$default[index]
income = data$income[index]
balance = data$balance[index]
lr.fit2 = glm(default ~ income + balance, family = binomial)
return(summary(lr.fit2)$coefficients[2:3,2])
}
```
```{r}
boot.fn(df,1:length(df$default))
```
__(c)__
```{r}
# Using boot() with R=100.
library(boot)
boot(df,boot.fn,100)
```
__(d)__
- The standard errors are slightly lower for both income and balance coefficients. In this case, using bootstrapping reduces the standard error for coefficient estimates.
__7.__ __(a)__ __(b)__
```{r}
set.seed(222)
lr.fit3 = glm(Direction ~ Lag1+Lag2, data=Weekly, family=binomial)
summary(lr.fit3)
```
__(c)__
```{r}
# Logistic regression using all observations except the first.
lr.fit3 = glm(Direction ~ Lag1+Lag2, data=Weekly[-c(1),], family=binomial)
# Predicting the direction of the first observation.
lr.probs3 = predict(lr.fit3,Weekly[1,], type="response")
if(lr.probs3 > 0.5) "Up" else "Down"
```
- Prediction is "Up" and this is incorrect.
__(d)__
```{r}
lr.preds = rep(NA, length(Weekly$Direction))
lr.err = rep(NA, length(Weekly$Direction))
for (i in 1:length(Weekly$Direction)){
lr.fit=glm(Direction ~ Lag1+Lag2, data=Weekly[-c(i),], family=binomial)
lr.probs = predict(lr.fit,Weekly[i,], type="response")
#Make predictions for row i (left out of training set) and compare to actual prediction, then use if else statement to state 1 if incorrect and 0 if correct.
if(lr.probs > 0.5) lr.preds[i]="Up" else lr.preds[i]="Down"
if(lr.preds[i]==Weekly$Direction[i]) lr.err[i]=0 else lr.err[i]=1
}
```
__(e)__
Count of 1's divided by n:
```{r}
#LOOCV Error rate
cv.err = sum(lr.err==1)/length(Weekly$Direction)
print(cv.err)
table(Weekly$Direction)
```
- LOOCV error rate is 44.9%. This shows that the model is correct in 55% of its predictions, which is slightly better than random guessing.
#### __Applied__
__8.__ __(a)__
```{r}
set.seed(1)
x=rnorm(100)
y=x-2*x^2+rnorm(100)
```
- $n=100$, $p=2$, $Y = X-2X^2+e$
__(b)__
```{r}
plot(x,y)
```
- The dataset is non-linear and the relationship between X and Y is roughly quadratic.
__(c)__
```{r}
set.seed(11)
x = c(rnorm(100))
y = c(x-2*x^2+rnorm(100))
df = data.frame(x,y)
cv.err = rep(0,4)
for (i in 1:4){
lr.fit = glm(y~poly(x,degree=i,raw=TRUE))
cv.err[i] = cv.glm(df, lr.fit)$delta[1]
}
cv.err
```
- As expected, there is a sharp drop in LOOCV estimate for MSE when a quadratic term is added. The error increases as higher polynomial terms are used, again confirming the quadratic nature of the dataset.
__(d)__
```{r}
set.seed(123)
x = c(rnorm(100))
y = c(x-2*x^2+rnorm(100))
plot(x,y)
df = data.frame(x,y)
cv.err = rep(0,4)#
for (i in 1:4){
lr.fit = glm(y~poly(x,degree=i,raw=TRUE))
cv.err[i] = cv.glm(df, lr.fit)$delta[1]
}
cv.err
```
- The results are similar to (c). As can be seen from the chart, the dataset is similar to before, and so the lowest error is given by the model with a quadratic term.
__(e)__
- The model with the quadratic term had the lowest LOOCV error. This is expected as the the dataset is roughly quadratic in shape - as show in (b).
__(f)__
```{r}
summary(glm(y~poly(x,degree=1,raw=TRUE)))
summary(glm(y~poly(x,degree=2,raw=TRUE)))
summary(glm(y~poly(x,degree=4,raw=TRUE)))
```
- Model (ii): Degree=1 and 2 coefficients are statistically significant. This is expected and matches the CV results, where this model (with predictors $X$ and $X^2$) provides the lowest test error.
- Model (iv): Degree=1 and 2 coefficients are statistically significant. This matches model (ii).
__(9)__ __(a)__ __(b)__
```{r}
# Mean of medv
mu = mean(Boston$medv)
# Standard deviation of mean
length = length(Boston$medv)
sd = sqrt(sum((Boston$medv - mu)^2)/(length-1))
# Standard error of mean
se = sd/sqrt(length)
se
```
__(c)__
```{r}
# Standard error using bootstrapping
boot.fn2 = function(data, index){
X=data$medv[index]
mu2 = mean(X)
return(mu2)
}
```
```{r}
set.seed(1)
boot(Boston,boot.fn2,1000)
```
- The standard error using bootstrapping with R=1000 is 0.41, which is almost the same as in (c).
__(d)__
```{r}
mu-2*se #lower bound
mu+2*se #higher bound
```
```{r}
t.test(Boston$medv)
```
- The confidence intervals nearly the same when using the approximation formula or using one sample t-test.
__(e)__ __(f)__
```{r}
# Median value of medv
median.medv = median(Boston$medv)
# Bootstrap estimate of the standard error of the median value.
boot.fn3 = function(data, index){
X=data$medv[index]
Y = median(X)
return(Y)
}
boot(Boston,boot.fn3,1000)
```
- SE(median) = 0.37, which is relatively small when compared to the median value of 21.2. So we can be reasonably confident of the estimate.
__(g)__
```{r}
tenth.percentile = quantile(Boston$medv, 0.1)
boot.fn4 = function(data, index){
X=data$medv[index]
Y = quantile(X, 0.1)
return(Y)
}
boot(Boston,boot.fn4,1000)
```
- The standard error 0.499, which is relatively low when compared to the 10th percentile value of 12.75.