-
Notifications
You must be signed in to change notification settings - Fork 2
/
11_Resampling_LinearModels.Rmd
619 lines (454 loc) · 39.2 KB
/
11_Resampling_LinearModels.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
# Resampling Linear Models
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
# Set my default chunk options
knitr::opts_chunk$set( fig.height=3 )
```
```{r, warning=FALSE, message=FALSE}
library(dplyr)
library(ggplot2)
library(ggfortify)
library(car) # for the Boot function
library(boot) # for the boot function
```
The last several chapters have introduced a number of parametric models where we assume that the error terms are normally distributed.
$$\begin{aligned}
\textrm{One-sample t-test:} & \;\;\; Y_{i}=\mu+\epsilon_{i} & \textrm{where} & \;\;\;\epsilon_{i}\stackrel{iid}{\sim}N\left(0,\sigma\right) \\
\textrm{Two-sample t-test:} & \;\;\;Y_{ij}=\mu_{i}+\epsilon_{ij} & \textrm{where} & \;\;\;\epsilon_{ij}\stackrel{iid}{\sim}N\left(0,\sigma\right)\;\;\;i\in\left\{ 1,2\right\} \\
\textrm{ANOVA:} & \;\;\;Y_{ij}=\mu_{i}+\epsilon_{ij} & \textrm{where} & \;\;\;\epsilon_{ij}\stackrel{iid}{\sim}N\left(0,\sigma\right)\;\;\;i\in\left\{ 1,2,\dots,k\right\} \\
\textrm{Regression:} & \;\;\;Y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i} & \textrm{where} & \;\;\;\epsilon_{i}\stackrel{iid}{\sim}N\left(0,\sigma\right)
\end{aligned}$$
We developed hypothesis tests and confidence intervals for the model parameters assuming that the error terms were normally distributed and, in the event that they are normally distributed, those tests and confidence intervals are the best we can do. However, if the errors are not normally distributed, what should we do?
Previously we used bootstrapping to estimate the sampling distribution of the sampling statistic when we didn't know the distribution. We will use the same bootstrapping method, but we'll simplify all of the above cases to the the same simple linear model
$$Y_{i}=E\left(Y_{i}\right)+\epsilon_{i}\;\;\;\textrm{where}\;\;\epsilon_{i}\stackrel{iid}{\sim}N\left(0,\sigma\right)$$
and $E\left(Y_{i}\right)$ takes on some form of the parameters depending on the model specified. It turns out that R can do all of these analyses using the same `lm()` function we used in for regression.
## Using `lm()` for many analyses
### One-sample t-tests
In this model we are concerned with testing
$$\begin{aligned}
H_{0}: & \;\; \mu=\mu_{0}\\
H_{a}: & \;\; \mu\ne\mu_{0}
\end{aligned}$$
for some $\mu_{0}$. For example, suppose we have the following data and we want to test $H_{0}:\mu=5 vs H_{a}:\mu\ne5$. The R code we used previously was
```{r}
# How we previously did a t.test
test.data <- data.frame( y=c(3,5,4,5,7,13) )
t.test( test.data$y, mu=5 )
```
but we can just as easily consider this a linear model with only an intercept term.
```{r}
m1 <- lm(y ~ 1, data=test.data)
summary(m1)
confint(m1)
```
Notice that we get the same point estimate and confidence interval for $\mu$, but the p-value is different because the `t.test()` p-value is testing $H_{0}:\;\mu=5$ vs $H_{a}:\;\mu\ne5$ while the `lm()` function is testing $H_{0}:\;\mu=0$ vs $H_{a}:\;\mu\ne0$.
If we really want the correct p-value, we should test if the difference between the $y$ variable and 5 is zero.
```{r}
m1 <- lm(y-5 ~ 1, data=test.data)
summary(m1)
```
### Two-sample t-tests
This model is concerned with testing
$$\begin{aligned}
H_{0}: & \;\; \mu_{1}=\mu_{2} \\
H_{a}: & \;\; \mu_{1}\ne\mu_{2}
\end{aligned}$$
```{r}
# How we previously did a t.test
test.data <- data.frame( y=c(3, 5, 4, 5, 7, 13,
8, 9, 4, 16, 12, 13 ),
group=rep(c('A','B'), each=6) )
t.test( y ~ group, data=test.data, var.equal=TRUE )
```
This analysis gave use the mean of each group and the confidence interval for the difference $\mu_{2}-\mu_{1}$. We could get the same analysis an ANOVA with $k=2$ groups.
```{r}
m2 <- lm(y ~ group, data=test.data)
summary(m2)
coef(m2)
confint(m2)
```
Aside from `t.test()` reporting $\mu_{2}-\mu_{1}$ while the `lm()` function calculates $\mu_{1}-\mu_{2}$, the estimates are identical.
## Creating Simulated Data
The basic goal of statistics is that we are interested in some population (which is described by some parameter $\mu,\delta,\tau,\beta$, or generally, $\theta$) and we take a random sample of size $n$ from the population of interest and we truly believe that the sample is representative of the population of interest. Then we use some statistic of the data $\hat{\theta}$ as an estimate $\theta$. However we know that this estimates, $\hat{\theta}$, vary from sample to sample. Previously we've used that the Central Limit Theorem gives $$\hat{\theta}\stackrel{\cdot}{\sim}N\left(\theta,\,\sigma_{\hat{\theta}}\right)$$
to construct confidence intervals and perform hypothesis tests, but we don't necessarily like this approximation. If we could somehow take repeated samples (call these repeated samples
$\mathbb{Y}_{j}$ for $j\in1,2,\dots,M$) from the population we would understand the distribution of $\hat{\theta}$ by just examining the distribution of many observed values of $\hat{\theta}_{j}$ where $\hat{\theta}_{j}$ is the statistic calculated from the ith sample data $\mathbb{Y}_{j}$.
However, for practical reasons, we can't just take 1000s of samples of size n from the population. However, because we truly believe that $\mathbb{Y}$ is representative of the entire population, then our best guess of what the population is just many repeated copies of our data.
```{r, echo=FALSE}
# Sample data
data <- data.frame(
x=rep(1:3, times=3), y=rep(1:3, each=3),
shape=c('Sq','Sq','Di', 'Cir','Sq','Cir', 'Tri','Cir','Sq') )
data2 <- rbind(
mutate(data, x=x+0, y=y+0),
mutate(data, x=x+0, y=y+3),
mutate(data, x=x+0, y=y+6),
mutate(data, x=x+3, y=y+0),
mutate(data, x=x+3, y=y+3),
mutate(data, x=x+3, y=y+6),
mutate(data, x=x+6, y=y+0),
mutate(data, x=x+6, y=y+3),
mutate(data, x=x+6, y=y+6))
library(ggdendro)
sample.plot <- ggplot(data=data) +
geom_point(aes(x=x, y=y, shape=shape), size=20, fill='dark grey') +
scale_shape_manual(values=c(21,23,22,24), guide=FALSE) +
coord_cartesian(xlim = c(0.5, 3.5), c(0.5,3.5)) +
theme_dendro() + ggtitle('Sample')
population.plot <- ggplot(data=data2) +
geom_point(aes(x=x, y=y, shape=shape), size=8, fill='dark grey') +
scale_shape_manual(values=c(21,23,22,24), guide=FALSE) +
coord_cartesian(xlim = c(0.5, 10), c(0.5,10)) +
theme_dendro() + ggtitle('Approximate Population')
```
Suppose we were to sample from a population of shapes, and we observed 4/9 of the sample were squares, 3/9 were circles, and a triangle and a diamond. Then our best guess of what the population that we sampled from was a population with 4/9 squares, 3/9 circles, and 1/9 of triangles and diamonds.
```{r, echo=FALSE}
Rmisc::multiplot(sample.plot, population.plot, cols=2)
```
Using this approximated population (which is just many many copies of our sample data), we can take many samples of size $n$. We denote these bootstrap samples as $\mathbb{Y}_{j}^{*}$, where the star denotes that the sample was taken from the approximate population, not the actual population. From each bootstrap sample $\mathbb{Y}_{j}^{*}$ a statistic of interest can be taken $\hat{\theta}_{j}^{*}$.
Because our approximate population is just an infinite number of copies of our sample data, then sampling from the approximate population is equivalent to sampling with replacement from our sample data. If I take $n$ samples from $n$ distinct objects with replacement, then the process can be thought of as mixing the $n$ objects in a bowl and taking an object at random, noting which it is, replace it into the bowl, and then draw the next sample. Practically, this means some objects will be selected more than once and some will not be chosen at all. To sample our observed data with replacement, we'll use the `resample()` function in the `mosaic` package. We see that some rows will be selected multiple times, and some will not be selected at all.
### Observational Studies vs Designed Experiments
The process of collecting data is a time consuming and laborious process but is critical to our understanding of the world. The fundamental goal is to collect a sample of data that is representative of the population of interest and can provide insight into the scientific question at hand. There are two primary classes about how this data could be gathered, observational studies and designed experiments.
In an observational study, a population is identified and a random sample of individuals are selected to be in the sample. Then each subject in the sample has explanatory and response variables measured (fish are weighed and length recorded, people asked their age, gender, occupation etc). The critical part of this data collection method is that the random selection from the population is done in a fashion so that each individual in the population could potentially be in the sample and there is no systematic exclusion of certain parts of the population.
_Simple Random Samples_ - Suppose that we could generate a list of every individual in the population and then we were to randomly select n of those to be our sample. Then each individual would have an equal chance to be in the sample and this selection scheme should result in sample data that is representative of the population of interest. Often though, it is difficult to generate a list of every individual, but other proxies might work. For example if we wanted to understand cougar behavior in the Grand Canyon, we might divide the park up into 100 regions and then random select 20 of those regions to sample and observe whatever cougar(s) are in that region.
_Stratified Random Samples_ - In a stratified random sample, the population can be broken up into different strata and we perform a simple random sample within each strata. For example when sampling lake fish, we might think about the lake having deep and shallow/shore water strata and perhaps our sampling technique is different for those two strata (electro-fishing on shore and trawling in the deep sections). For human populations, we might stratify on age and geographic location (older retired people will answer the phone more readily than younger people). For each of the strata, we often have population level information about the different strata (proportion of the lake that is deep water versus shallow, or proportion of the population 20-29, 30-39, etc. and sample each strata accordingly (e.g. if shallow water is 40% of the fish habitat, then 40% of our sampling effort is spent in the shallows).
Regardless of sample type, the key idea behind an observational study is that we don't apply a treatment to the subject and then observe a response. While we might annoy animal or person, we don't do any long-term manipulations. Instead the individuals are randomly selected and then observed, and it is the random selection from the population that results in a sample that is representative of the population.
_Designed Experiments_ - In an experimental setting, the subjects are taken from the population (usually not at random but rather by convenience) and then subjected to some treatments and we observe the individuals response to the treatment. There will usually be several levels of the treatment and there often is a control level. For example, we might want to understand how to maximize the growth of a type of fungus for a pharmaceutical application and we consider applying different nutrients to the substrate (nothing, +phosphorus, +nitrogen, +both). Another example is researchers looking at the efficacy of smoking cessation methods and taking a set of willing subjects and having them try different methods (no help, nicotine patches, nicotine patches and a support group). There might be other covariates that we expect might affect the success rate (individuals age, length of time smoking, gender) and we might make sure that our study include people in each of these groups (we call these blocks in the experimental design terminology, but they are equivalent to the strata in the observational study terminology). Because even within blocks, we expect variability in the success rates due to natural variation, we randomize the treatment assignment to the individual and it is this randomization that addresses any unrecognized lurking variables that also affect the response.
A designed experiment is vastly superior to an observational experiment because the randomization of the treatment accounts for variables that the researcher might not even suspect to be important. A nice example of the difference between observational studies and experiments is a set of studies done relating breast cancer and hormone replacement therapy (HRT) drugs used by post-menopausal women. Initial observational studies that looked at the rates of breast cancer showed that women taking HRT had lower rates of breast cancer. When these results were first published, physicians happily recommended HRT to manage menopause symptoms and to decrease risk of breast cancer. Unfortunately subsequent observational studies showed a weaker effect and among some populations there was an increase in breast cancer. To answer the question clearly, a massive designed experiment was undertaken where women would be randomly assigned either a placebo or the actual HRT drugs. This study conclusively showed that HRT drugs increased the risk of breast cancer.
Why was there a disconnect between the original observational studies and the experiment? The explanation given is that there was a lurking variable that the observational studies did not control for... socio-economic class. There are many drivers of breast cancer and some of them are strongly correlated with socio-economic class such as where you live (in a polluted area or not). Furthermore because HRT was initially only to relieve symptoms of menopause, it wasn't “medically necessary” and insurance didn't cover it and so mainly wealthy women (with already lower risk for breast cancer) took the HRT drugs and the simple association between lower breast cancer risk and HRT was actually the effect of socio-economic status. By randomly assigning women to the placebo and HRT groups, high socio-economic women ended up in both groups. So even if there was some other lurking variable that the researchers didn't consider, the randomization would cause the unknown variable to be evenly distributed in the placebo and HRT groups.
Because the method of randomization is so different between observational studies and designed experiments, we should make certain that our method of creating bootstrap data sets respects that difference in randomization. So if there was some constraint on the data when it was originally taken, we want the bootstrap datasets to obey that same constraint. If our study protocol was to collect a sample of $n_{1}=10$ men and $n_{2}=10$ women, then we want our bootstrap samples to have $10$ men and $10$ women. If we designed an experiment with $25$ subjects to test the efficacy of a drug and chose to administer doses of $5, 10, 20, 40,$ and $80$ mg with each five subjects for each dose level, then we want those same dose levels to show up in the bootstrap datasets.
There are two common approaches, _case resampling_ and _residual resampling_. In case re-sampling, we consider the data $\left(x_{i,}y_{i}\right)$ pairs as one unit and when creating a bootstrap sample, we re-sample those pairs, but if the $i$th data point is included in the bootstrap sample, then it is included as the $\left(x_{i,}y_{i}\right)$ pair. In contrast, residual re-sampling is done by first fitting a model to the data, finding the residual values, re-sampling those residuals and then adding those bootstrap residuals to the predicted values $\hat{y}_{i}$.
```{r, echo=FALSE}
set.seed(3)
```
```{r}
Testing.Data <- data.frame(
x = c(3,5,7,9),
y = c(3,7,7,11))
Testing.Data
```
```{r}
# Case resampling
Boot.Data <- mosaic::resample(Testing.Data)
Boot.Data
```
Notice that we've sampled $\left\{ x=5,y=7\right\}$ twice and did not get the $\left\{ 7,7\right\}$ data point.
Residual sampling is done by re-sampling the residuals and calling them $\hat{\epsilon}^{*}$ and then the new y-values will be $y_{i}^{*}=\hat{y}_{i}+\hat{\epsilon}_{i}^{*}$
```{r}
# Residual resampling
model <- lm( y ~ x, data=Testing.Data)
Boot.Data <- Testing.Data %>%
mutate( fit = fitted(model),
resid = resid(model),
resid.star = mosaic::resample(resid),
y.star = fit + resid.star )
Boot.Data
```
Notice that the residuals re-sampling results in a data set where each of the x-values is retained, but a new y-value (possibly not seen in the original data) is created from the predicted value $\hat{y}$ and a randomly selected residual.
In general when we design an experiment, we choose which x-values we want to look at and so the bootstrap data should have those same x-values we chose. So for a designed experiment, we typically will create bootstrap data sets via residual re-sampling. For observational studies, we'll create the bootstrap data sets via case re-sampling. In both cases if there is a blocking or strata variable to consider, we will want to do the re-sampling within the block/strata.
## Confidence Interval Types
We want to understand the relationship between the sample statistic $\hat{\theta}$ to the population parameter $\theta$. We create an estimated population using many repeated copies of our data. By examining how the simulated $\hat{\theta}^{*}$ vary relative to $\hat{\theta}$, we will understand how possible $\hat{\theta}$ values vary relative to $\theta$.
```{r, echo=FALSE, cache=TRUE}
M <- 1001
theta <- seq(-10,60, length=M)
temp <- data.frame(
theta = theta,
y = dgamma(theta, 10, 1/2))
temp <- filter(temp, (theta >= 0)&(theta <= 60))
p1 <- ggplot(temp, aes(x=theta, y=y)) +
geom_line() +
geom_vline(xintercept=20, color='red') +
geom_text(data=data.frame(x=22, y=.01, label='theta'),
aes(x=x, y=y, label=label),
parse=TRUE, color='red') +
labs(title=expression(paste('Sampling Distribution of ',hat(theta))),
x=NULL, y='density') +
theme_bw()
temp <- data.frame(
theta = theta,
y = dgamma(theta-7, 10, 1/2))
temp <- filter(temp, (theta >= 0)&(theta <= 60))
p2 <- ggplot(temp, aes(x=theta, y=y)) +
geom_line() +
geom_vline(xintercept=26, color='blue') +
geom_text(data=data.frame(x=28, y=.01, label='hat(theta)'),
aes(x=x, y=y, label=label),
parse=TRUE, color='blue') +
labs(title=expression(paste('Sampling Distribution of ',hat(theta),'*')),
x=NULL, y='density') +
theme_bw()
```
```{r, echo=FALSE}
Rmisc::multiplot(p1, p2, cols=1)
```
We will outline several methods for producing confidence intervals (in the order of most assumptions to fewest).
### Normal intervals
This confidence interval assumes the sampling distribution of $\hat{\theta}$ is approximately normal (which is often true due to the central limit theorem). We can use the bootstrap replicate samples to get an estimate of the standard error of the statistic of interest by just calculating the sample standard deviation of the replicated statistics.
Let $\theta$ be the statistic of interest and $\hat{\theta}$ be the value of that statistic calculated from the observed data. Define $\hat{SE}^{*}$ as the sample standard deviation of the $\hat{\theta}^{*}$ values.
Our first guess as to a confidence interval is
$$\hat{\theta}\pm z_{1-\alpha/2}\hat{SE}^{*}$$
which we could write as
$$\left[\hat{\theta}-z_{1-\alpha/2}\hat{SE}^{*},\;\;\;\hat{\theta}+z_{1-\alpha/2}\hat{SE}^{*}\right]$$
### Percentile intervals
The percentile interval doesn't assume normality but it does assume that the bootstrap distribution is symmetric and unbiased for the population value. This is the method we used to calculate confidences intervals in the first several chapters. It is perhaps the easiest to calculate and understand. This method only uses $\hat{\theta}^{*}$, and is
$$\left[\hat{\theta}_{\alpha/2}^{*}\;,\;\;\hat{\theta}_{1-\alpha/2}^{*}\right]$$
### Basic intervals
Unlike the percentile bootstrap interval, the basic interval does not assume the bootstrap distribution is symmetric but does assume that $\hat{\theta}$ is an unbiased estimate for $\theta$.
To address this, we will using the observed distribution of our replicates $\hat{\theta}^{*}$. Let $\hat{\theta}_{\alpha/2}^{*}$ and $\hat{\theta}_{1-\alpha/2}^{*}$ be the $\alpha/2$ and $1-\alpha/2$ quantiles of the replicates $\hat{\theta}^{*}$. Then another way to form a confidence interval would be $$\left[\hat{\theta}-\left(\hat{\theta}_{1-\alpha/2}^{*}-\hat{\theta}\right),\;\;\;\;\hat{\theta}-\left(\hat{\theta}_{\alpha/2}^{*}-\hat{\theta}\right)\right]$$
where the minus sign on the upper limit is because $\left(\hat{\theta}_{\alpha/2}^{*}-\hat{\theta}\right)$ is already negative. The idea behind this interval is that the sampling variability of $\hat{\theta}$ from $\theta$ is the same as the sampling variability of the replicates $\hat{\theta}^{*}$ from $\hat{\theta}$, and that the distribution of $\hat{\theta}$ is possibly skewed, so we can't add/subtract the same amounts. Suppose we observe the distribution of $\hat{\theta}^{*}$ as
```{r, echo=FALSE}
thetas <- seq(0, 10, length=1000)
plot(thetas, dchisq(thetas, df=3), type='l',
axes=FALSE, xlab='', ylab='',
main=expression(paste('Distribution of ', hat(theta),'*')))
axis(side=1, at=c(.5, 2, 8),
label=c(expression(paste(hat(theta)[.025],'*')),
expression(hat(theta)),
expression(paste(hat(theta)[.975],'*'))))
box()
```
Then any particular value of $\hat{\theta}^{*}$ could be much larger than $\hat{\theta}$. Therefore $\hat{\theta}$ could be much larger than $\theta$. Therefore our confidence interval should be $\left[\hat{\theta}-\textrm{big},\;\hat{\theta}+\textrm{small}\right]$.
This formula can be simplified to
$$\left[\hat{\theta}-\left(\hat{\theta}_{1-\alpha/2}^{*}-\hat{\theta}\right)\;,\,\hat{\theta}+\left(\hat{\theta}-\hat{\theta}_{\alpha/2}^{*}\right)\right]
\left[2\hat{\theta}-\hat{\theta}_{1-\alpha/2}^{*}\;,\;\;2\hat{\theta}-\hat{\theta}_{\alpha/2}^{*}\right]$$
### Towards bias-corrected and accelerated intervals (BCa)
Different schemes for creating confidence intervals can get quite complicated. There is a thriving research community investigating different ways of creating intervals and which are better in what instances. The BCa interval is the most general of the bootstrap intervals and makes the fewest assumptions. Unfortunately is can sometimes fail to converge. The details of this method are too complicated to be presented here but can be found in texts such as chapter 12 in Efron and Tibshirani’s book An Introduction to the Bootstrap (1998).
## Bootstrap Confidence Intervals in R
### Using `car::Boot()` function
For every model we've examined we can create simulated data sets using either case or residual re-sampling and produce confidence intervals for any of the parameters of interest. We won't bother to do this by hand, but rather let R do the work for us. The package that contains most of the primary programs for bootstrapping is the package `boot`. The functions within this package are quite flexible but they are a little complex. While we will use this package directly later, for now we will use the package `car` which has a very convenient function `car::Boot()`.
We return to our ANOVA example of hostility scores after three different treatment methods.
The first thing we will do (as we should do in all data analyses) is to graph our data.
```{r}
# define the data
Hostility <- data.frame(
HLT = c(96,79,91,85,83,91,82,87,
77,76,74,73,78,71,80,
66,73,69,66,77,73,71,70,74),
Method = c( rep('M1',8), rep('M2',7), rep('M3',9) ) )
```
```{r}
ggplot(Hostility, aes(x=Method, y=HLT)) +
geom_boxplot()
```
We can fit the cell-means model and examine the summary statistics using the following code.
```{r}
model <- lm( HLT ~ -1 + Method, data=Hostility )
summary(model)
```
Confidence intervals using the
$$\epsilon_{ij}\stackrel{iid}{\sim}N\left(0,\sigma\right)$$ assumption are given by
```{r}
confint(model)
```
To utilize the bootstrap confidence intervals, we will use the function `car::Boot` from the package `car`. It defaults to using case re-sampling, but `method='residual'` will cause it to use residual re-sampling. We can control the number of bootstrap replicates it using with the R parameter.
```{r, cache=TRUE}
boot.model <- Boot(model, method='case', R=999) # default case resampling
boot.model <- Boot(model, method='residual', R=999) # residual resampling
```
The `car::Boot()` function has done all work of doing the re-sampling and storing values of $\hat{\mu}_{1},\hat{\mu}_{2}$, and $\hat{\mu}_{3}$ for each bootstrap replicate data set created using case re-sampling. To look at the bootstrap estimate of the sampling distribution of these statistics, we use the `hist()` function. The `hist()` function is actually overloaded and will act differently depending on the type of object. We will send it an object of class boot and the `hist()` function looks for a function name `hist.boot()` and when it finds it, just calls it with the function arguments we passed.
```{r}
hist(boot.model, layout=c(1,3)) # 1 row, 3 columns of plots
```
While this plot is aesthetically displeasing (we could do so much better using ggplot2!) this shows the observed bootstrap histogram of $\hat{\mu}_{i}^{*}$, along with the normal distribution centered at $\hat{\mu}_{i}$ with spread equal to the $StdDev\left(\hat{\mu}_{i}^{*}\right)$. In this case, the sampling distribution looks very normal and the bootstrap confidence intervals should line up well with the asymptotic intervals. The function `confint()` will report the BCa intervals by default, but you can ask for “bca”, “norm”, “basic”, “perc”.
```{r}
confint(boot.model)
confint(boot.model, type='perc')
confint(model)
```
In this case we see that the confidence intervals match up very well with asymptotic intervals.
The `car::Boot()` function will work for a regression model as well. In the following example, the data was generated from
$$y_{i}=\beta_{0}+\beta_{1}x_{i}+\epsilon_{i}$$
but the $\epsilon_{i}$ terms have a strong positive skew and are not normally distributed.
```{r}
my.data <- data.frame(
x = seq(0,10, length=20),
y = c( 15.49, 17.42, 15.17, 14.99, 13.96,
14.46, 13.69, 14.30, 13.61, 15.35,
12.94, 13.26, 12.65, 12.33, 12.04,
11.19, 13.76, 10.95, 10.36, 10.63))
ggplot(my.data, aes(x=x, y=y)) + geom_point()
```
Fitting a linear model, we see a problem that the residuals don't appear to be balanced. The large residuals are all positive. The Shapiro-Wilks test firmly rejects normality of the residuals.
```{r}
model <- lm( y ~ x, data=my.data)
plot(model, which=1)
```
```{r}
shapiro.test( resid(model) )
```
As a result, we don't might not feel comfortable using the asymptotic distribution of $\hat{\beta}_{0}$ and $\hat{\beta}_{1}$ for the creation of our confidence intervals. The bootstrap procedure can give reasonable good intervals, however.
```{r}
boot.model <- Boot( model ) # by default method='case'
hist( boot.model )
confint( boot.model )
```
Notice that both of the bootstrap distribution for both $\hat{\beta}_{0}^{*}$ and $\hat{\beta}_{1}^{*}$ are skewed, and the BCa intervals are likely to be the most appropriate intervals to use.
### Using the `boot` package
The `car::Boot()` function is very handy, but it lacks flexibility; it assumes that you just want to create bootstrap confidence intervals for the model coefficients. The `car::Boot()` function is actually a nice simple user interface to the boot package which is more flexible, but requires the user to be more precise about what statistic should be stored and how the bootstrap samples should be created. We will next examine how to use this package.
#### Case resampling
Suppose that we have n observations in our sample data. Given some vector of numbers re-sampled from `1:n`, we need to either re-sample those cases or those residuals and then using the new dataset calculate some statistic. The function `boot()` will require the user to write a function that does this.
```{r}
model <- lm( y ~ x, data=my.data )
coef(model)
# Do case resampling with the regression example
# sample.data is the original data frame
# indices - This is a vector of numbers from 1:n which tells
# us which cases to use. It might be 1,3,3,6,7,7,...
my.stat <- function(sample.data, indices){
data.star <- sample.data[indices, ]
model.star <- lm(y ~ x, data=data.star)
output <- coef(model.star)
return(output)
}
# original model coefficients
my.stat(my.data, 1:20)
# one bootstrap replicate
my.stat(my.data, mosaic::resample(1:20))
```
Notice that the function we write doesn't need to determine the random sample of the indices to use. Our function will be told what indices to use (possibly to calculate the statistic of interest $\hat{\theta}$, or perhaps a bootstrap replicate $\hat{\theta}^{*}$. For example, the BCa method needs to know the original sample estimates $\hat{\theta}$ to calculate how far the mean of the $\hat{\theta}^{*}$ values is from $\hat{\theta}$. To avoid the user having to see all of that, we just need to take the set of indices given and calculate the statistic of interest.
```{r, cache=TRUE}
boot.model <- boot(my.data, my.stat, R=10000)
#boot.ci(boot.model, type='bca', index=1) # CI for Intercept
#boot.ci(boot.model, type='bca', index=2) # CI for the Slope
confint(boot.model)
```
#### Residual Resampling
We will now consider the ANOVA problem and in this case we will re-sample the residuals.
```{r, cache=TRUE}
# Fit the ANOVA model to the Hostility Data
model <- lm( HLT ~ Method, data=Hostility )
# now include the predicted values and residuals to the data frame
Hostility <- Hostility %>% mutate(
fit = fitted(model),
resid = resid(model))
# Do residual resampling with the regression example
my.stat <- function(sample.data, indices){
data.star <- sample.data %>% mutate(HLT = fit + resid[indices])
model.star <- lm(HLT ~ Method, data=data.star)
output <- coef(model.star)
return(output)
}
boot.model <- boot(Hostility, my.stat, R=10000)
```
```{r}
confint(boot.model)
```
Fortunately the `hist()` command can print the nice histogram from the output of the `boot()` command.
```{r}
hist( boot.model, layout=c(1,3)) # 1 row, 3 columns)
```
Notice that we don't need to have the model coefficients $\hat{\mu}_{i}$ be our statistic of interest, we could just as easily produce a confidence interval for the residual standard error $\hat{\sigma}$.
```{r, cache=TRUE}
# Do residual resampling with the regression example
model <- lm( y ~ x, data=my.data )
my.data <- my.data %>% mutate(
fitted = fitted(model),
resid = resid(model))
# Define the statisitc I care about
my.stat <- function(sample.data, indices){
data.star <- sample.data %>% mutate(y = fitted + resid[indices])
model.star <- lm(y ~ x, data=data.star)
output <- summary(model.star)$sigma
return(output)
}
boot.model <- boot(my.data, my.stat, R=10000)
```
```{r}
hist(boot.model, layout=c(1,3))
confint(boot.model)
```
#### Including Blocking/Stratifying Variables
When we introduced the ANOVA model we assumed that the groups had equal variance but we don't have to. If we consider the model with unequal variances among groups
$$Y_{ij}=\mu_{i}+\epsilon_{ij}\;\;\;\;\textrm{where}\;\;\;E\left(\epsilon_{ij}\right)=0\;\;Var\left(\epsilon_{ij}\right)=\sigma_{i}^{2}$$
then our usual analysis is inappropriate but we could easily bootstrap our confidence intervals for $\mu_{i}$. If we do case re-sampling, this isn't an issue because each included observation is an $\left(group,\ response\right)$ pair and our groups will have large or small variances similar to the observed data. However if we do residual re-sampling, then we must continue to have this. We do this by only re-sampling residuals within the same group. One way to think of this is if your model has a subscript on the variance term, then your bootstrap samples must respect that.
If you want to perform the bootstrap by hand using dplyr commands, it can be done by using the `group_by()` with whatever the blocking/Stratifying variable is prior to the `mosaic::resample()` command. You could also use the optional group argument to the `mosaic::resample()` command.
```{r}
data <- data.frame(y =c(9.8,9.9,10.1,10.2, 18,19,21,22),
grp=c('A','A','A','A', 'B','B','B','B'),
fit=c( 10,10,10,10, 20,20,20,20 ),
resid=c(-.2,-.1,.1,.2, -2,-1,1,2 ))
data.star <- data %>%
group_by(grp) %>% # do the grouping using dplyr
mutate(resid.star = mosaic::resample(resid),
y.star = fit + resid.star)
data.star
```
```{r}
data.star <- data %>%
mutate(resid.star = mosaic::resample(resid, group=grp), # do the grouping within resample
y.star = fit + resid.star)
data.star
```
Unfortunately the `car::Boot()` command doesn't take a strata option, but the the `boot::boot()` command.
```{r, cache=TRUE}
# Fit the ANOVA model to the Hostility Data
model <- lm( HLT ~ Method, data=Hostility )
# now include the predicted values and residuals to the data frame
Hostility <- Hostility %>% mutate(
fitted = fitted(model),
resid = resid(model))
# Do residual resampling
my.stat <- function(sample.data, indices){
data.star <- sample.data %>% mutate(HLT = fitted + resid[indices])
model.star <- lm(HLT ~ Method, data=data.star)
output <- coef(model.star)
return(output)
}
# strata is a vector of the categorical variable we block/stratify on
boot.model <- boot( Hostility, my.stat, R=1000, strata=Hostility$Method )
```
```{r}
hist(boot.model, layout=c(1,3))
confint(boot.model)
```
## Exercises
1. We will perform a regression analysis on the following data and use different bootstrap re-sampling methods to create a confidence interval for the slope parameter. In this case the residuals are symmetric, though perhaps we don't want to assume normality.
| | | | | | | |
|:---------------------:|:---:|:---:|:----:|:----:|:----:|:----:|
| $x_{i}$ | 3 | 4 | 5 | 6 | 7 | 8 |
| $y_{i}$ | 5 | 9 | 10 | 12 | 15 | 15 |
| $\hat{y}_{i}$ | 6 | 8 | 10 | 12 | 14 | 16 |
| $\hat{\epsilon}_{i}$ | -1 | 1 | 0 | 0 | 1 | -1 |
a) We will first use case resampling.
i. Suppose that the bootstrap indices are selected to be cases 1,3,3,4,6,6. Create a new dataset with those cases and calculate the regression coefficients $\hat{\beta}_{0}^{*}$ and $\hat{\beta}_{1}^{*}$ using R's `lm()` function. Notice that the residual $\hat{\epsilon}_{i}$ is always paired with its value of $x_{i}$ and that in case resampling we don't get the same x-values as our data.
ii. Using the `mosaic::resample()` command, calculate several values of $\hat{\beta}_{1}^{*}$ using case resampling.
iii. Use the `car::Boot()` function to calculate the BCa confidence interval for $\hat{\beta}_{1}$ with case resampling
b) Next we will use Residual Resampling
i. Suppose that the bootstrap indices are selected to be cases 1,3,3,4,6,6. Create a new dataset with those cases and calculate the regression coefficients $\hat{\beta}_{0}^{*}$ and $\hat{\beta}_{1}^{*}$ using R's `lm()` function. Notice that the residual $\hat{\epsilon}_{i}$ is not necessarily paired with its value of $x_{i}$ and that the new data set has the same x-values as the original sampled data.
ii. Using the `mosaic::resample` command, calculate several values of $\hat{\beta}_{i}^{*}$. Hint: We can't do this in one simple command, instead we have to make the new dataset and then fit the regression.
iii. Use the `car::Boot()` function to calculate the BCa confidence interval for $\hat{\beta}_{1}$ using residual resampling.
2. The ratio of DDE (related to DDT) to PCB concentrations in bird eggs has been shown to have had a number of biological implications. The ratio is used as an indication of the movement of contamination through the food chain. The paper “The ratio of DDE to PCB concentrations in Great Lakes herring gull eggs and its us in interpreting contaminants data” reports the following ratios for eggs collected at sites from the five Great Lakes. The eggs were collected from both terrestrial and aquatic feeding birds. Suppose that we are interested in estimating $\rho=\frac{\mu_{terrestial}}{\mu_{aquatic}}$
```{r}
Pollution <- data.frame(
value = c(76.50, 6.03, 3.51, 9.96, 4.24, 7.74, 9.54, 41.70, 1.84, 2.50, 1.54,
0.27, 0.61, 0.54, 0.14, 0.63, 0.23, 0.56, 0.48, 0.16, 0.18 ),
type = c( rep('Terrestrial',11), rep('Aquatic',10) ) )
model <- lm( value ~ -1 + type, data=Pollution)
coef(model)
```
a) Recall that the ANOVA with the cell mean representation will calculate the group means. Use the lm() function to calculate the means of the two groups. Notice that the p-values and any confidence intervals from this model are useless because we are egregiously violating the equal variance and normality assumptions on the residuals.
b) Using R, calculate the ratio $\hat{\rho}=\bar{y}_{T}/\bar{y}_{A}$. _Hint: what is returned by `coef(model)[1]`_?
c) Use the `mosaic::resample()` function to generate several bootstrap datasets using case resampling. Do you get 11 Terrestrial observations in each dataset? Do this ten or twenty times (don't show these computations) and note the most unbalanced data set.
d) Use the `mosaic::resample()` function to generate several bootstrap datasets using residual resampling? Do you get data sets where a simulated aquatic observation has been paired with a huge residual term from the terrestrial. Does this seem appropriate?
e) The `mosaic::resample()` function includes an optional `groups=` argument that does the resampling within a specified group (thus we will always get 11 Terrestrial observations and 10 Aquatic). Use this to generate several bootstrap datasets.
f) The `car::Boot()` function cannot handle the grouping, but `boot::boot()` can.
i. The following function will calculate $\hat{\rho}$, the statistic of interest, given the original data and a set of indices to use. Notice that we've chosen to do case resampling here.
```{r}
calc_rhohat <- function(data, indices){
data.star <- data[indices, ]
model.star <- lm( value ~ -1 + type, data=data.star )
return( coef(model.star)[2] / coef(model.star)[1] )
}
```
ii. Call this function using the Pollution data set and indices 1:21. Notice that this calculates the sample statistic $\hat{\rho}$ that we calculated previously.
iii. Call this function using `indices = resample(1:21, groups=Pollution$type)`. Notice that this calculates the sample statistic $\hat{\rho}^{*}$ where we are doing case resampling within each group.
iv. Use the `boot::boot()` to perform the full bootstrap analysis. Use the option `strata=Pollution$type` option, which is causes R to do the resampling within each group.
v. What is the 95% BCa CI for $\rho$? Show the histogram of the bootstrap estimate of the distribution of $\hat{\rho}$.