-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathproblem-set-1.Rmd
428 lines (340 loc) · 15.8 KB
/
problem-set-1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
---
title: 'Biost 578: Problem Set 1'
author: "Alejandro Hernandez"
date: "Due Friday 19, April 2024"
output:
pdf_document: default
html_document:
df_print: paged
subtitle: Department of Biostatistics @ University of Washington
editor_options:
markdown:
wrap: 72
---
```{r setup, include=F}
# clear environment
rm(list=ls())
# load relevant packages
library(MASS) # negative binomial distribution
library(tidyverse) # data manipulation
# devtools::install_github("mbannick/RobinCar")
library(RobinCar) # causal models with covariate adjustment
library(rigr) # regression
library(knitr) # table formatting
# setup options
knitr::opts_chunk$set(echo = F, warning = F)
options(knitr.kable.NA = '-')
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "llm_appendix", "allcode")]
set.seed(0927)
```
```{r data-generation-function, include=F}
### -----------------------------------------------------------
# This is a function we will use for data generation. A=1 is treated and A=0 is control
Fun_datagen <- function(Fun.n = 500,
Fun.y.type = c("continuous","binary","count"),
Fun.p = 2/3) {
# generate the covariates Xc, Xb
df <- tibble(
Xc = runif(Fun.n),
Xb = rbinom(Fun.n, size = 1, prob = 0.5)
)
# generate treatment assignment A by simple randomization
df <- df %>% mutate(A = rbinom(n=Fun.n, size=1, prob=Fun.p))
# generate outcome y
# if y is continuous
if (Fun.y.type == "continuous") {
df <- df %>%
mutate(y = (1-A)*(Xc+0.1*Xb) + A*(0.3*Xc+0.3*Xb))
# if y is binary
} else if (Fun.y.type == "binary") {
df <- df %>%
mutate(y = rbinom(n = Fun.n, size = 1,
prob = (1-A)*(0.5*Xc+0.25*Xc^2+0.1*Xb) +
A*exp(Xc+Xc^2+0.3*Xb)/(1+exp(Xc+Xc^2+0.3*Xb))))
# if y is positive discrete
} else if (Fun.y.type == "count") {
df <- df %>%
mutate(y = MASS::rnegbin(n = Fun.n,
mu = A*(2*Xc+5*Xc^2+0.1*Xb) + (1-A)*log(6*Xc^3+2+0.3*Xb),
theta = 4))
}
df <- df %>% mutate(A = factor(A))
return(df)
}
```
## Problem 1
```{r problem-1}
### -----------------------------------------------------------
### Problem 1
# Generate a simulated dataset with n = 500 and continuous outcome under simple
# randomization using 1:2 allocation ratio to control and treatment
set.seed(0927)
dfSim <- Fun_datagen(Fun.n = 500, Fun.y.type = "continuous", Fun.p = 2/3)
# dfSim %>%
# mutate(Xb = factor(Xb)) %>%
# summary
#### (1a) ####
# Fit a linear model Y ~ A and obtain the coefficient of A. Compare it with the
# mean outcome difference between the treated and control group
# ANOVA estimate of ATE from SLR model
slr <- rigr::regress("mean", y ~ A, data=dfSim)
# ANOVA estimate of ATE (produces equal ATE estimate as SLR model)
# dfSim %>%
# group_by(A) %>%
# summarize(mean.outcome = mean(y)) %>%
# reframe(diff(mean.outcome))
#### (1b) ####
# Fit a linear model Y ~ 1 + A + X and obtain the coefficient of A. This is the
# ANCOVA estimate of ATE
mlr <- rigr::regress("mean", 1 + y ~ A + Xc + Xb, data=dfSim)
#### (1c) ####
# Fit a linear model Y ~1 + A + (X-Xbar) + A(X-Xbar) and obtain the coefficient
# of A. This is the ANHECOVA estimate of ATE
mlr_int <- dfSim %>%
mutate(centered_Xc = Xc - mean(Xc), centered_Xb = Xb - mean(Xb)) %>%
regress("mean", data=.,
y ~ 1 + A + centered_Xc + centered_Xb + A*centered_Xc + A*centered_Xb)
#### (1d) ####
# Compare the ANHECOVA estimator in (c) with the model Y ~1 + A + X + AX
mlr_int_uncentered <- regress("mean", data = dfSim,
y ~ 1 + A + Xc + Xb + A*Xc + A*Xb)
# coef(mlr_int)["A1","Estimate"]
# coef(mlr_int_uncentered)["A1","Estimate"]
#### (1e) ####
# Use the robincar_linear2 function in the RobinCar R package to obtain the
# estimators in (a)-(c). Compare the point estimators and the robust standard
# errors using these three estimation methods
robin_slr <- RobinCar::robincar_linear2(df = dfSim, treat_col = "A",
response_col = "y",
adj_method = "ANOVA", contrast_h="diff")
robin_mlr <- RobinCar::robincar_linear2(df = dfSim, treat_col = "A",
response_col = "y",
covariate_cols = c("Xc", "Xb"),
adj_method = "ANCOVA", contrast_h="diff")
robin_mlr_int <- RobinCar::robincar_linear2(df = dfSim, treat_col = "A",
response_col = "y",
covariate_cols = c("Xc", "Xb"),
adj_method = "ANHECOVA", contrast_h="diff")
# estimates from rigr models
bind_rows(slr$coefficients["A1",],
mlr$coefficients["A1",],
mlr_int$coefficients["A1",])[c(1,3,7)] %>%
mutate(Estimator = c("ANOVA", "ANCOVA", "ANHECOVA"),
.before = "Estimate") %>%
knitr::kable(digits = 4,
caption = "rigr estimates of ATE")
# estimates from RobinCar models
bind_rows(robin_slr$contrast$result,
robin_mlr$contrast$result,
robin_mlr_int$contrast$result)[,-1] %>%
mutate(Estimator = c("ANOVA", "ANCOVA", "ANHECOVA"),
.before = "estimate") %>%
knitr::kable(digits = 4,
caption = "RobinCar estimates of ATE",
col.names = c("Estimator", "Estimate", "Robust SE", "Pr(>|t|)"))
```
#### (a)
The ANOVA estimate of ATE, $\bar{Y_1} - \bar{Y_0}$, is equivalent to the estimate of ATE from a simple linear model $Y \sim A$.
#### (b) (c)
See Table 1.
#### (d)
The ANHECOVA estimate of ATE, from the model $Y \sim 1 + A + (X-\bar{X}) + A(X-\bar{X})$ is -0.2251; without centering covariates $X$ this estimate becomes -7.0728e-16, a wildly different result. The model with centered covariates is the correct one, as it improves model stability and resolves collinearity issues among the covariates.
#### (e)
The estimates of ATE from `rigr` and `RobinCar` models are identical. The same is true for their robust standard errors, except for the ANHECOVA estimate, in which the `rigr` estimate has greater efficiency. Across both packages, the more robust estimators have smaller standard errors.
## Problem 2
```{r problem-2}
### -----------------------------------------------------------
### Problem 2
# Generate a simulated dataset with n = 500 and binary outcome under simple
# randomization using 1:2 allocation ratio to control and treatment
set.seed(0927)
dfSim <- Fun_datagen(Fun.n = 500, Fun.y.type = "binary", Fun.p = 2/3)
#### (2a) ####
# Calculate the ANOVA estimator (the mean outcome difference between the treated
# and control)
outcome.means <- dfSim %>% group_by(A) %>% summarize(mean.response = mean(y))
ATE.anova <- (outcome.means[2,2] - outcome.means[1,2])[[1]]
#### (2b) ####
# Fit a logistic model of P(Y=1|A,X) and estimate ATE using g-computation
log.reg <- rigr::regress("odds", y ~ A + Xb + Xc, data = dfSim)
treatment_potential <- dfSim %>% mutate(A = factor(1)) %>%
predict(log.reg, newdata = ., type = "response")
control_potential <- dfSim %>% mutate(A = factor(0)) %>%
predict(log.reg, newdata = ., type = "response")
ATE.gcomp <- mean(treatment_potential - control_potential)
#### (2c) ####
# Use the robincar_linear2 and robincar_glm2 functions in the RobinCar R package
# to obtain the ANOVA and g-computation estimators in (a)-(b), as well as their
# robust standard errors. Compare the point estimators and the robust standard
# errors using these two estimation methods
robin_lr <- RobinCar::robincar_glm2(df = dfSim,
treat_col = "A", response_col = "y",
g_family = stats::binomial,
formula = as.formula("y ~ A + Xc + Xb"),
contrast_h = "diff")
ATE.anova.robin <- diff(robin_lr$main$result$estimate)[[1]]
ATE.gcomp.robin <- robin_lr$contrast$result
# all estimates of average treatment effect (ATE)
data.frame(
Estimator = c("ANOVA", "ANOVA (RobinCar)",
"g-computation (rigr)", "g-computation (RobinCar)"),
Estimate = c(ATE.anova, ATE.anova.robin,
ATE.gcomp, ATE.gcomp.robin$estimate[[1]]),
Robust.SE = c(NA, NA, NA, ATE.gcomp.robin$se[[1]])) %>%
knitr::kable(caption = "Estimators of ATE",
digits = 4)
```
```{r problem-2-supplementary, include=F}
### -----------------------------------------------------------
### Problem 2 Supplementary
# sequence of sample sizes
ATE_estimates <- c()
# iterate over sample sizes
set.seed(0927)
for (i in 1:1000) {
# fit model to simulated sample
dfSim <- Fun_datagen(Fun.n = 1000, Fun.y.type = "binary", Fun.p = 2/3)
log.reg <- rigr::regress("odds", y ~ A + Xc + Xb, data=dfSim)
# estimate ATE
treatment_potential <- dfSim %>% mutate(A = factor(1)) %>%
predict(log.reg, newdata = ., type = "response")
control_potential <- dfSim %>% mutate(A = factor(0)) %>%
predict(log.reg, newdata = ., type = "response")
ATE.gcomp <- mean(treatment_potential - control_potential)
ATE_estimates <- c(ATE_estimates, ATE.gcomp)
}
# plot ATE estimates and mark RobinCar estimate
data.frame(ATE_estimates) %>%
ggplot(aes(x = ATE_estimates, y = ..density..)) +
geom_histogram(fill = "goldenrod") +
geom_vline(xintercept = ATE.gcomp.robin$estimate[[1]], color = "coral4", lwd = 1) +
xlab("ATE estimates") + ylab("Density") +
theme_bw()
### The plot illustrates g-computation with this logistic regression model as
### producing unbiased estimates of ATE (treating the RobinCar estimate as the
### truth)
```
#### (a) (b)
See Table 3.
#### (c)
The ANOVA estimates of ATE are similar, but it appears the mean difference between treated and control groups that is not computed by `RobinCar` overestimates ATE. The g-computation estimates from `rigr` and `RobinCar` agree.
## Problem 3
```{r problem-3}
### -----------------------------------------------------------
### Problem 3
# Generate a simulated dataset with n = 500 and count outcome under simple
# randomization using 1:2 allocation ratio to control and treatment
set.seed(0927)
dfSim <- Fun_datagen(Fun.n = 500, Fun.y.type = "count", Fun.p = 2/3)
#### (3a) ####
# Calculate the ANOVA estimator (the mean outcome difference between the treated
# and control)
outcome.means <- dfSim %>% group_by(A) %>% summarize(mean.response = mean(y))
ATE.anova <- (outcome.means[2,2] - outcome.means[1,2])[[1]]
#### (3b) ####
# Fit a negative binomial model of Y ~ A + X with an unknown dispersion param
# and estimate ATE using g-computation. Is this g-computation estimator
# (asymptotically) unbiased?
neg.binom <- MASS::glm.nb(y ~ A + Xc + Xb, data=dfSim)
treatment_potential <- dfSim %>% mutate(A = factor(1)) %>%
predict(neg.binom, newdata = ., type = "response")
control_potential <- dfSim %>% mutate(A = factor(0)) %>%
predict(neg.binom, newdata = ., type = "response")
ATE.gcomp <- mean(treatment_potential - control_potential)
#### (3c) ####
# When the g-computation estimator is biased, the robincar_glm2 automatically
# calculates the AIPW estimator which is a debiased g-computation estimator.
# Use the robincar_linear2 and robincar_glm2 functions to obtain the ANOVA
# estimator in (a) and the AIPW estimator using the negative binomial model in
# (b), as well as their robust standard errors. Compare the point estimators and
# the robust standard errors using these two estimation method
outcome.means.robin <- RobinCar::robincar_linear2(df = dfSim,
treat_col = "A",
response_col = "y")$result
ATE.anova.robin <- diff(outcome.means.robin$estimate)[[1]]
robin_nb <- RobinCar::robincar_glm2(df = dfSim, treat_col = "A",
response_col = "y", g_family = "nb",
formula = as.formula("y ~ A + Xc + Xb"))
ATE.gcomp.robin <- RobinCar::robincar_contrast(result = robin_nb,
contrast_h = "diff")$result
# all estimates of average treatment effect (ATE)
data.frame(
Estimator = c("ANOVA", "ANOVA (RobinCar)",
"g-computation (rigr)", "g-computation (RobinCar)"),
Estimate = c(ATE.anova, ATE.anova.robin,
ATE.gcomp, ATE.gcomp.robin$estimate[[1]]),
Robust.SE = c(NA, NA, NA, ATE.gcomp.robin$se[[1]])) %>%
knitr::kable(caption = "Estimators of ATE", digits = 3)
```
```{r problem-3-supplementary, fig.cap="Asymptotic behavior of negative binomial g-computation (red marks the AIPW estimate of ATE)"}
### -----------------------------------------------------------
### Problem 3 Supplementary
# sequence of sample sizes
ATE_estimates <- c()
# iterate over sample sizes
set.seed(0927)
for (i in 1:1000) {
# fit model to simulated sample
dfSim <- Fun_datagen(Fun.n = 1000, Fun.y.type = "count", Fun.p = 2/3)
neg.binom <- MASS::glm.nb(y ~ A + Xc + Xb, data=dfSim)
# estimate ATE
treatment_potential <- dfSim %>% mutate(A = factor(1)) %>%
predict(neg.binom, newdata = ., type = "response")
control_potential <- dfSim %>% mutate(A = factor(0)) %>%
predict(neg.binom, newdata = ., type = "response")
ATE.gcomp <- mean(treatment_potential - control_potential)
ATE_estimates <- c(ATE_estimates, ATE.gcomp)
}
# plot ATE estimates and mark RobinCar estimate
data.frame(ATE_estimates) %>%
ggplot(aes(x = ATE_estimates, y = ..density..)) +
geom_histogram(fill = "goldenrod") +
geom_vline(xintercept = ATE.gcomp.robin$estimate[[1]], color = "coral4", lwd = 1) +
xlab("ATE estimates") + ylab("Density") +
theme_bw()
### The plot illustrates g-computation with this negative binomial model as
### producing biased underestimates of ATE (treating the RobinCar estimate as the
### truth)
```
#### (a)
See Table 4.
#### (b)
See Table 4. Estimates of ATE from g-computation may have large bias when using models other than linear, logistic, or poisson. This negative binomial model produces asymptotically biased estimates that underestimate ATE, illustrated in Figure 1 (see below).
#### (c)
The ANOVA estimates agree and overestimate ATE, unlike the biased g-computation underestimate. The `RobinCar` estimate using AIPW (which is a debiased g-computation estimator) provides an estimate that accounts for biased induced by the negative binomial model.
## Problem 4 (Ungraded)
In this problem, we consider randomization inference for non-binary
treatments. Consider a setting in which we have n units labelled i = 1,
..., *n*, but instead of the usual binary intervention, we have K
possible treatments, i.e. A~i~ $\in$ {1, ..., *K*}. Consider the
generalization of the completely randomized design seen in class, with K
treatments. That is, for fixed values 0 \< *n~1~*, ..., *n~K~* \< *n*,
we assign exactly *n~1~*units to treatment 1, *n~2~* units to treatment
2, ..., and *n~K~* units to treatment *K*, such that all items have
equal probability.
### (a)
For *k* $\in$ {1, ..., *K*}, determine $P$(A~i~ = *k*).
The probability a unit receives treatment *k* is $$\frac{n_k}{n}$$ where $n_k$ is the number of units receiving treatment *k* and $n$ is the total number of units.
### (b)
Assuming SUTVA, how many potential outcomes does each unit have?
Each unit has $K$ potential outcomes, one for each treatment.
### (c)
For *k* $\not =$ *k'* $\in$ {1, ..., *K*} write down $\tau$~kk'~ for the
sample average treatment effect of *k* vs *k'* (with all the potential
outcomes as fixed). That is, contrasting *k* and *k'* instead of 1 and 0
as in the binary case. *(Hint: refer to the statistical theory for
Neyman repeated sampling inference on pages 33-34 in Lecture 2).*
$\tau_{kk'} = \frac{1}{n}\sum_{i=1}^{n}Y_i(k)-Y_i(k') = \bar{Y_k} - \bar{Y_{k'}}$
### (d)
Propose an analog $\tau$~kk'~\_hat to the difference-in-means estimator,
for estimating $\tau$~kk'~.
**Incomplete**
### (e)
Prove that $\tau$~kk'~\_hat is unbiased for $\tau$~kk'~.
**Incomplete**
\pagebreak
## Code Appendix
```{r allcode, ref.label = knitr::all_labels(), echo=TRUE, eval=FALSE}
```
**End of document.**