-
Notifications
You must be signed in to change notification settings - Fork 13
/
10-Case-Study-2.Rmd
221 lines (160 loc) · 6.17 KB
/
10-Case-Study-2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "Case Study 2 - Solution"
output: html_notebook
editor_options:
chunk_output_type: inline
---
<!-- This file by Jake Thompson is licensed under a Creative Commons Attribution 4.0 International License, adapted from the orignal work at https://github.com/rstudio/master-the-tidyverse by RStudio. -->
```{r setup}
library(tidyverse)
library(broom)
library(rsample)
library(tidydscompanion)
library(here)
```
## Task
Reproduce these figures below, created from an analysis of the `admission` data.
![](`r here("resources", "density.png")`)
![](`r here("resources", "confint.png")`)
## Data
In the `tidydscompanion` package there is a data set containing simulated admissions data for a graduate program.
## Your Turn 1
What are the steps needed to create these plots?
## You Turn 2
* Create a cross validation resampling with 10 folds and 10 repeats
* Save the object as `models`
```{r}
set.seed(32011)
models <- admission
```
## Your Turn 3
* Complete the function that takes in a `splits` and `formula` and returns predictions
* Model should be fit using the **analysis** data
* Predictions should be made on the **assessment** data
* Use `mutate` to add columns to the predictions
* Predicted acceptance is `1` if `.fitted` is greater than 0.5, `0` otherwise
* Prediction is corrected if predicted value (from above) is the same as `admit`
```{r}
holdout_results <- function(splits, formula) {
# Fit the model to the analysis set
mod <- glm(___, data = ___, family = binomial)
# Save the assessment data
holdout <- ___
# `augment` will save the predictions with the holdout data set
res <- broom::augment(___, newdata = ___, type.predict = "response") %>%
mutate(prediction = ___,
correct = ___)
# Return the assessment data set with the additional columns
res
}
```
## Your Turn 4
* Use `mutate` and `map` to use the `holdout_results` function to fit each model to the cross validation sets
* Formulas for each model have already been saved for your convenience
```{r}
empty <- as.formula(admit ~ 1)
academics <- as.formula(admit ~ gre_v * gre_q + gre_w + gpa)
full <- as.formula(admit ~ gre_v * gre_q + gre_w + gpa + gender)
all_mods <- models %>%
mutate(empty_mod = map(___, ___, formula = ___),
acadm_mod = map(___, ___, formula = ___),
compl_mod = map(___, ___, formula = ___))
```
## Your Turn 5
* Tidy the data so that the models are all in one column (`results`) with an identifier column (`model`)
* Expand the `results` so we can do calculations on the predictions
```{r}
all_preds <- all_mods %>%
select(-splits)
```
## Your Turn 6
* Calculate the percent of applicants correctly classified for each repeat, fold, and model
* Plot the distributions for each model
```{r}
all_preds
```
## Your Turn 7
* Calculate the Log Loss for each repeat, fold, and model
```{r}
all_preds
```
## Your Turn 8
* For each model, calculate the median, 2.5, and 97.5 percentiles of the Log Loss
* 95% confidence interval
* Plot the results using `geom_errorbarh`
* Hint: look at the required aesthetics using `?geom_errorbarh`
```{r}
all_preds %>%
group_by(id, id2, model) %>%
summarize(logloss = -1 * mean((admit * log(.fitted)) + ((1 - admit) * log(1 - .fitted))))
```
## Extra Challenge
How would you add multiple error bars fo varying confidence intervals?
Are there other methods we could use to measure the predictive accuracy of the models?
## Final Graphics
```{r include = FALSE}
library(hrbrthemes)
library(colorblindr)
```
### Correct Classification
```{r}
all_preds %>%
group_by(id, id2, model) %>%
summarize(pct_cor = mean(correct)) %>%
ggplot(aes(x = pct_cor)) +
geom_density(aes(color = model, fill = model), alpha = 0.4) +
scale_color_OkabeIto(limits = c("compl_mod", "acadm_mod", "empty_mod"),
breaks = c("empty_mod", "acadm_mod", "compl_mod"),
labels = c("Empty", "Academics", "Full")) +
scale_fill_OkabeIto(limits = c("compl_mod", "acadm_mod", "empty_mod"),
breaks = c("empty_mod", "acadm_mod", "compl_mod"),
labels = c("Empty", "Academics", "Full")) +
expand_limits(x = c(0.6, 0.9)) +
scale_x_percent() +
labs(x = "Correct Classification Rate", y = "Density", color = "Model",
fill = "Model",
title = "Distribution of Applicants Correctly Classified",
subtitle = "Using 10-fold cross validation with 10 repeats") +
theme_ipsum_ps() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(override.aes = list(alpha = 1)))
```
### Log Loss
```{r}
all_preds %>%
group_by(id, id2, model) %>%
summarize(logloss = -1 * mean((admit * log(.fitted)) + ((1 - admit) * log(1 - .fitted)))) %>%
group_by(model) %>%
summarize(med = median(logloss),
lb_99 = quantile(logloss, prob = 0.005),
ub_99 = quantile(logloss, prob = 0.995),
lb_95 = quantile(logloss, prob = 0.025),
ub_95 = quantile(logloss, prob = 0.975),
lb_80 = quantile(logloss, prob = 0.100),
ub_80 = quantile(logloss, prob = 0.900)) %>%
pivot_longer(contains("_"), names_to = "boundary", values_to = "value") %>%
separate(boundary, into = c("bound", "level")) %>%
pivot_wider(names_from = bound, values_from = value) %>%
mutate(level = paste0(level, "%")) %>%
ggplot() +
geom_errorbarh(
aes(y = model, xmin = lb, xmax = ub, color = level, size = level),
height = 0
) +
geom_point(aes(x = med, y = model), color = "#E69F00", size = 3) +
expand_limits(x = c(0.2, 0.7)) +
scale_y_discrete(limits = c("compl_mod", "acadm_mod", "empty_mod"),
labels = c("Full", "Academics", "Empty")) +
scale_color_manual(values = c(
`80%` = darken("#56B4E9", .2),
`95%` = "#56B4E9",
`99%` = lighten("#56B4E9", .4)
)) +
scale_size_manual(values = c(`80%` = 4, `95%` = 3, `99%` = 2)) +
labs(x = "Median Log Loss", y = NULL, size = "Confidence Level",
color = "Confidence Level",
title = "Log Loss of Competing Models",
subtitle = "Using 10-fold cross validation with 10 repeats") +
theme_ipsum_ps() +
theme(legend.position = "bottom")
```