-
Notifications
You must be signed in to change notification settings - Fork 0
/
r4da-week06.qmd
718 lines (565 loc) · 24.9 KB
/
r4da-week06.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
---
title: "Lab 6: Regression for Describing"
author: "Viktoriia Semenova"
editor: source
date: "October 18, 2023"
format:
html:
theme: slides.scss
code-fold: false
code-tools: true
code-summary: "Show the code"
code-overflow: wrap
pdf:
toc: true
colorlinks: true
geometry:
- top=30mm
- left=30mm
highlight-style: github
---
```{r setup}
#| include: false
# packages needed
# put all packages we use in a vector
p_needed <- c(
"tidyverse", # shortcut for ggplot2, dplyr, readr
"janitor", # clean_names for datasets
"visdat", # visual inspection of datasets
"janitor", # clean variable names
"modelsummary", # descriptive table
"styler", # to style the code
"broom",
"magrittr", # for more pipes
"remotes", # to install packages from GitHub
"styler" # to style the code
)
# check if they are already installed, install if not installed
lapply(
p_needed[!(p_needed %in% rownames(installed.packages()))],
install.packages
)
# install a package from GitHub (if not installed already)
if (!"equatiomatic" %in% rownames(installed.packages())) {
remotes::install_github("datalorax/equatiomatic")
}
# load the packages
lapply(p_needed, library, character.only = TRUE)
# same theme for all plots
ggplot2::theme_set(ggplot2::theme_minimal())
```
## Goals for Today
1. Regression with `R`
- Fitting a linear model
- Interpreting coefficients for continuous predictors
- Interpreting coefficients for categorical predictors\
2. Regression Diagnostics
3. Regression Output
- Make a well-formatted regression table with `modelsummary`
- Write up regression equations nicely with `equatiomatic`
{{< pagebreak >}}
## Dataset
We will be working with data on student evaluations of instructors' beauty and teaching quality for several courses at the University of Texas. The teaching evaluations were conducted at the end of the semester, and the beauty judgments were made later, by six students who had not attended the classes and were not aware of the course evaluations.
| Variable | Description |
|:------------------:|----------------------------------------------------|
| `eval` | Average course rating: (1) very unsatisfactory - (5) excellent |
| `beauty` | Composite beauty rating of 6 students', with 1 being lowest and 10 highest |
| `female` | Instructor's a female |
| `age` | Instructor's age |
| `minority` | Minority status |
| `nonenglish` | Instructor's non-English speaking native |
| `lower` | Lower division course (years 1-2) |
| `rank` | Professor's tenure rank (tenured, tenure track, teaching) |
| `course_id` | Unique course ID |
| `students` | Number of students per class |
| `formal` | Outfit of professor in picture: not formal, formal |
| `blkandwhite` | color of professor's picture: color, black & white |
| `bty_[student_id]` | Beauty rating of professor from one of 6 students |
| `tenure` | Instructor's on tenure track |
| `tenured` | Instructor's on tenure track and already got tenure |
> Is this an observational study or an experiment? The original research question posed in the paper is whether beauty leads directly to the differences in course evaluations. Given the study design, is it possible to answer this question as it is phrased? If not, rephrase the question.
```{r}
evals <- read_csv("data/evals.csv") %>%
clean_names()
glimpse(evals)
```
```{r}
# Descriptive stats table for our datasets
evals %>%
rename_with(str_to_title) %>% # capitalize all columns for table
select(-Course_id, -Prof) %>% # remove ID variables
modelsummary::datasummary_skim(
title = "Descriptive Statistics for Course Evaluations Dataset",
histogram = FALSE
)
```
## Plotting the Relationship
The fundamental phenomenon suggested by the study is that better looking teachers are evaluated more favorably. Let's create a scatterplot to see if this appears to be the case:
```{r}
ggplot(evals, aes(x = beauty, y = eval)) +
geom_point()
```
Before we draw conclusions about the trend, compare the number of observations in the data frame with the approximate number of points on the scatterplot. Is anything awry?
Let's replot the scatterplot, but this time use `geom_jitter` and changing the transparency of the data points to handle the potential overplotting:
```{r}
ggplot(evals, aes(beauty, eval)) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
# geom_smooth(
# method = "lm",
# se = FALSE
# ) +
ylim(1, 5) +
xlim(1, 10)
```
## Fitting a Linear Regression
### Continuous Predictor
Now let's fit the regression and explore the output:
```{r}
m_eval1 <- lm(eval ~ beauty, data = evals)
summary(m_eval1)
```
> What does the value of intercept tell us?\
> What does the value of *beauty* coefficient estimate tell us?
You can also write the same code as follows. Both options are valid, but I will be using the latter one as it is much easier to work with in `tidyverse` functions:
```{r}
lm(evals$eval ~ evals$beauty) %>%
summary()
```
Summary is a useful command for depicting the results when you are woring in `R`, but for plotting and doing manipulations with the estimates, it is much easier to have them in the dataframe format. `tidy()` from `broom` gives us exactly that:
```{r}
tidy(m_eval1)
```
### Practice: Explore the Effect of Age
Plot the relationship between age and course evaluations. Estimate the effect using a linear regression. Interpret the coefficients from the model.
```{r}
```
### Categorical Predictor
Now let's deviate and explore if there is an effect of gender on course evaluations (ignoring the `beauty` and all other variables for now):
```{r}
evals %>%
count(female)
m_eval2 <- lm(eval ~ female, data = evals)
summary(m_eval2)
```
> What does the value of intercept tell us?\
> What does the value of *female* coefficient estimate tell us?
Let's also plot this relationship. Is there anything you'd change about this plot?
```{r}
ggplot(evals, aes(female, eval)) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Female",
y = "Average Teaching Evaluation"
) +
# geom_smooth(
# method = "lm",
# se = FALSE
# ) +
ylim(1, 5)
```
Now let's go one step further and explore another categorical predictor's effect on evaluations. This time, let's look at a categorical/ordinal variable `rank`.
```{r}
evals %>%
count(rank)
```
Let's see what is the default behavior when we put a character variable as a predictor in `R`:
```{r}
m_eval3 <- lm(eval ~ rank, data = evals)
summary(m_eval3)
```
> Why do we only have two categories now? How do we interpret these coefficients?
The default approach creates dummy variables for each category and then includes all but one of these dummy variables into the regression equation (stepping into multiple linear regression territory). Each coefficient for the category-variable thus represents the difference to the baseline category, which missing from the coefficients list but is represented in the intercept value. To illustrate this, let's do the same procedure as `R` but by hand:
```{r}
evals <- evals %>%
mutate(
tenuretrack_d = if_else(rank == "tenure track", 1, 0),
tenure_d = if_else(rank == "tenured", 1, 0),
teaching_d = if_else(rank == "teaching", 1, 0),
)
evals %>%
count(teaching_d, tenuretrack_d, tenure_d, rank)
```
Now if we place all of the new dummy variables in the regression equation, we get almost the same result as above. Why do we see the row with `NA`s for `teaching_d`?
```{r}
m_eval4 <- lm(eval ~ tenuretrack_d + tenure_d + teaching_d, data = evals)
summary(m_eval4)
```
This is a better way to write that same thing (omitting the baseline catgory explicitly):
```{r}
m_eval4 <- lm(eval ~ tenuretrack_d + tenure_d, data = evals)
summary(m_eval4)
```
> Can we use a different baseline category? If yes, how? Will the coefficients change? What about predictions?
This default approach makes the most sense when we have the categories that are not ranked, like nominal variables (such as parties or conflict types). This is also what `R` does by default with character variables. Sometimes, you might want to know the difference between particular categories in the middle, like between *Tenure Track* and *Tenured*. We could change the baseline in this case so that the coefficients reflect the effect of interest for us. But another way is to create dummy variables slightly differently (we already have these variables in the dataset):
```{r}
evals %>%
count(rank, teaching_d, tenuretrack_d, tenure, tenure_d, tenured)
```
As a result, our predicted values for the dependent variable will be calculates as follows:
| | $D_1$ | $D_2$ | |
|-------------------|:-----:|:-----:|----------------------------------------|
| Teaching | 0 | 0 | $\hat Y = \beta_0$ |
| Tenure Track | 1 | 0 | $\hat Y = \beta_0 + \beta_1$ |
| Tenured | 1 | 1 | $\hat Y = \beta_0 + \beta_1 + \beta_2$ |
```{r}
m_eval5 <- lm(eval ~ tenure + tenured, data = evals)
summary(m_eval5)
```
> What is the baseline category here?
> What do the coefficients represent?
Crucially, regression is about _average_ trends in the data. Thus, when interpreting regression coefficients, it is important to indicate this fact:
> A one unit increase in *predictor* is associated with, on average, *coefficient for predictor* change in the dependent variable.
## Predictions with New Data
Let's go back to our model with beauty and course evaluations and generate some fitted values (along with other stats). `augment()` provides us nicely with the fitted values, i.e. predictions for the data points *existing* in our dataset.
```{r}
broom::augment(m_eval1)
```
Having fitted values, we can plot the line without `geom_smooth()`:
```{r}
# What is the difference between these plots?
broom::augment(m_eval1) %>%
ggplot(aes(x = beauty, y = eval)) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
geom_point(aes(y = .fitted)) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
ylim(1, 5) +
xlim(1, 10)
broom::augment(m_eval1) %>%
ggplot(aes(x = beauty, y = eval)) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
geom_line(aes(y = .fitted)) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
ylim(1, 5) +
xlim(1, 10)
```
To recap the process of calculating fitted values, let's obtain fitted values by hand:
```{r}
# manually
# say we want to predict the evaluations of someone with beauty score of, say, 5
m_eval1$coefficients # or coef(m_eval1)
manual_pred <- m_eval1$coefficients[1] + m_eval1$coefficients[2] * 5
manual_pred # note the name of the object is misleading
manual_pred %>% as.numeric() # remove the name
# or for a sequence of beauty scores from 1 to 10 at once
manual_pred <- m_eval1$coefficients[1] + m_eval1$coefficients[2] * seq(1, 10, by = 1)
manual_pred
# with predict() function
# create a tibble with the columns we need (using the basic tibble function)
evals_new <- tibble(
beauty = seq(1, 10, by = 1)
)
# create a tibble with the columns we need
# expand_grid will create all possible combinations of variable values
# will be useful with more than one variable later
evals_nes <- expand_grid(
beauty = seq(1, 10, by = 1)
)
evals_new
# add predictions to this data
evals_new <- evals_new %>%
mutate(pred = predict(m_eval1, newdata = .))
evals_new
# check that manual predictions are the same as predict()
manual_pred == evals_new$pred
```
> Why did we choose `by = 1` in `seq()` function? Can we set it to another value?
```{r}
ggplot(evals_new, aes(x = beauty, y = pred)) +
geom_point() +
geom_line()
```
### Practice: Predictions for Categorical Variables
Work with the model for gender's effect on evaluations (`m_eval2`). Calculate the predictions manually for both males and females:
```{r}
```
## Residuals & Diagnostics
Linear regression comes with a number of assumptions, and inspecting residuals allows us to check whether they are met.
A residuals plot is a scatterplot of the regression residuals against the explanatory variable $X$ or the predicted (fitted) values $\hat Y$. Residual plot is a diagnostic plot as it helps us to detect patterns in the residuals. Patterns in residuals signal that systematic influences on $Y $still have not been captured by our model, or that our model misrepresents the data, or that errors do
not have a constant variance. Ideally, residuals plots should look as if the pattern was generated by pure chance.
Let's have a look at the residuals plot (recall that `augment()` calculates them for us):
```{r}
augment(m_eval1) %>%
ggplot(aes(x = beauty, y = .resid)) +
geom_jitter(alpha = 0.5) +
geom_hline(
yintercept = 0,
color = "#800010"
) +
labs(
x = "Beauty Score",
y = "Residuals"
) +
geom_smooth(se = F) # loess can help identify patterns
```
Let's think about the constant variance, a.k.a. homoskedasticity, assumption:
> What can we say about variance of residuals across the values of our independent variable, *Beauty Score*? Do we seem to be able to predict evaluations for some values of *Beauty Score* better (i.e. they have smaller residuals) than for others (i.e. the residuals are larger)?
Now about the conditional mean assumption:
> Do the residuals, across each value of *Beauty score*, seem to sum up to zero (cancel each other out, are distributed equally above and below the zero line)?
When inspecting the residuals, it is important to identify observations that may be dominating the estimates, and to assess how results change when such dominating data points are dropped. Here is a reminder on the terminology here:
- *Outlier*: An observation with a *large residual*; may indicate a sample peculiarity, a data entry error, or some other problems.
- *Leverage*: An observation with an extreme value on an *independent* variable is said to have high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points may have an effect on the estimate of regression
coefficients.
- *Influence*: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients.
Now let's check the model for these influential points:
> Which observations (if any) seem to have large residuals, i.e. are potential outliers?
> Which observations seem to be leverage points (if any)?
There is also another useful quantity to evaluate influence, *Cook’s distance*: it is a measure of influence, which aggregates outlier and leverage properties together. The measure of Cook's distance is based on the difference in predicted (fitted) values with
and without particular observations. `augment()` calculates it for us:
```{r}
m_eval1 %>%
augment() %>%
pull(.cooksd) %>%
summary()
```
There are multiple interpretations for this measure, with some researchers recommending to be suspicious of observations with $D_i > 1$ and others goind as far as $\frac{4}{n}$, with $n$ being the sample size.
```{r}
m_eval1 %>%
augment() %>%
filter(.cooksd > 1)
m_eval1 %>%
augment() %>%
filter(.cooksd > 4 / n())
augment(m_eval1) %>%
mutate(large_cd = if_else(.cooksd > 4 / n(), TRUE, FALSE)) %>%
ggplot(aes(x = beauty, y = .resid, color = large_cd)) +
geom_jitter(alpha = 0.5) +
geom_hline(
yintercept = 0,
color = "#800010"
) +
labs(
x = "Beauty Score",
y = "Residuals",
color = "Cook's Distance above 4/n"
) +
theme(legend.position = "bottom") +
scale_color_viridis_d()
```
In general, if we have observations with "large" Cook's distance, we either want to re-specify our model to account for outliers (do they have something in common?) or to exclude them (and acknowledge that) and re-estimate the model to see if the results change.
With this being a bivariate regression, it is quite likely that there are other variables that are not accounted for. For the sake of demonstration, however, we will re-estimate the model here, dropping the observations with high Cook's distance:
```{r}
evals_sample <- m_eval1 %>%
augment() %>%
filter(.cooksd <= 4 / n())
# re-estimate the model
lm(eval ~ beauty, data = evals_sample) %>%
summary()
# original model
summary(m_eval1)
```
## Multiple Linear Regression
```{r}
m_eval6 <- lm(eval ~ beauty + female, data = evals)
res_eval6 <- tidy(m_eval6)
```
> What is missing in this interpretation?\
> - A one-unit increase in `beauty` is associated with, on average, `r round(m_eval6$coefficients["beauty"], 3)` increase in teaching evaluation score.
### Visualizing MLR with Continuous and Categorical Predictor
Let's visualize this model. What seems off on this plot?
1. The regression line does not correspond to the model we estimated (i.e. the line is just the bivariate relationship between y and x)
2. Female seems like a continuous variable (which is not the case in this dataset)
```{r}
ggplot(evals, aes(beauty, eval, color = female)) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
geom_smooth(method = "lm", fullrange = TRUE, se = F)
```
Changing the type of the `female` variable to character or to factor fixed the latter problem, and we have two lines now, one for females and one for males. But according to our model specification, the lines should be parallel, i.e. the effect of beauty is assumed to be the same for both males and females:
```{r}
evals %>%
mutate(female = if_else(female == 1, "Female", "Male")) %>%
ggplot(aes(beauty, y = eval, color = (female))) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
scale_color_viridis_d()
```
We can manually draw to separate lines with `geom_abline()` like this:
```{r}
evals %>%
mutate(female = if_else(female == 1, "Female", "Male")) %>%
ggplot(aes(beauty, y = eval, color = (female))) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation"
) +
geom_abline(
intercept = res_eval6$estimate[1],
slope = res_eval6$estimate[2],
color = viridis::viridis(2)[1] # set matching color manually
) +
geom_abline(
intercept = res_eval6$estimate[1] + res_eval6$estimate[3],
slope = res_eval6$estimate[2],
color = viridis::viridis(2)[2] # set matching color manually
) +
scale_color_viridis_d() +
labs(color = "") +
theme(legend.position = "top") +
scale_x_continuous(limits = c(1, 10), breaks = 1:10) +
ylim(c(1, 5.5)) + # why not c(1, 5)?
theme(panel.grid.minor = element_blank())
```
Or we can use `geom_line()` and `predict()` function to plot the parallel lines much quicker. Note that in this case, the regression lines are only plotted for the range of data we have, unlike the `geom_abline()`. The latter does what is called *extrapolation* outside of the range of our data.
```{r}
evals %>%
mutate(female = if_else(female == 1, "Female", "Male")) %>%
ggplot(aes(beauty, y = eval, color = (female))) +
geom_jitter(
size = 2,
alpha = 0.5,
width = 0.1
) +
labs(
x = "Beauty",
y = "Average Teaching Evaluation",
title = "Relationship between Instructor Beauty and Course Evaluations",
color = "Instructor is"
) +
geom_line(aes(y = predict(m_eval6))) + # plots the parallel lines
scale_color_viridis_d() +
theme(legend.position = "top") +
scale_x_continuous(limits = c(1, 10), breaks = 1:10) +
ylim(c(1, 5.5)) + # why not c(1, 5)?
theme(panel.grid.minor = element_blank())
```
### Models with Multiple Predictors
```{r}
m_eval7 <- lm(eval ~ beauty + female + age + nonenglish, data = evals)
tidy(m_eval7)
```
- How do we interpret these coefficients?
> For a male English instructor with beauty score of zero and of age zero, the teaching evaluations are expected to be 3.89. This value is meaningless as age of zero is not possible given the assumed DGP.\
> A one point increase in the beauty score is, on average, associated with a 0.08 increase in course evaluations, holding all else constant. Course of female instructors are, on average, evaluated 0.21 points lower than courses of male instructors, holding age, beauty, and native language constant.\
> Having English as their native language is associated with an average increase of 0.33 points in course evaluations, holding all else constant.\
> On average, every additional 10 years in age of the professor are associated with a 0.02 decrease in course evaluations, holding the beauty, sex, and native language constant.\
------------------------------------------------------------------------
## Working with Model Output
### Well-formatted Table
There are plenty of packages that you can use to make nice regression tables. We will primarily use `modelsummary`, which works well for both HTML and PDF formats and requires no extra manipulations.
```{r}
modelsummary(
models = list(m_eval1, m_eval2, m_eval5),
title = "Linear Regression Estimates",
gof_omit = "IC|F|Log.",
align = c("lccc"),
coef_map = c( # renames + selects + orders coefs in the table
"beauty" = "Beauty Score",
"female" = "Female",
"tenure" = "Rank: Tenure Track",
"tenured" = "Rank: Tenured",
"(Intercept)" = "Intercept"
),
notes = "Notes: Standard errors in parentheses. Baseline category for Rank variable is Teaching.",
fmt = 2, # round to 3 digits after zero
statistic = NULL, # have no extra columns for SE
estimate = "{estimate} ({std.error})", # print coef and SE in one line
)
```
### Regression equation
While writing papers (in this class and beyond), you will be asked to specify your regression' systematic component. It is fairly easy to do with Latex \[Lah-tech\], a special typesetting environment perfect for writing up formulas and equations. Here is how you would write the systematic component of the last model we worked with. You would write something like...
> I estimate an OLS model with the following systematic component: $$
> \operatorname{\widehat{Evaluation}} = \hat \beta_{0} + \hat\beta_{1}\text{Beauty Score}
> $$
This usually goes in the *Analysis* section of the research paper.
Good news: some smart people also wrote a package `equatiomatic` that extracts the model equations from model objects directly:
```{r}
#| results: asis
#| echo: fenced
equatiomatic::extract_eq(m_eval1, intercept = "beta", ital_vars = T)
```
Note a few things:
- this equation is different from what we wrote above: it includes the error term, so it covers both the systematic and the stochastic components of the model
- we need to tell `R` to treat the output as raw text by specifying chunk option `#| results: asis` for the equation to appear correctly
- this works well with HTML output, but sometimes can cause troubles in PDF. A quick fix would be to start with the output provided, edit it, and insert it manually
## Resources
- Bivariate Regression Analysis and Diagnostics in <https://openintro-ims2.netlify.app/07-model-slr>
- Multiple Linear Regression Analysis in <https://openintro-ims2.netlify.app/08-model-mlr>
## Data Prep Example: Course Evaluations Data
```{r}
read_csv("http://www.stat.ucla.edu/~handcock/216/examples/beauty/ProfEvaltnsBeautyPublic.csv") %>%
rowwise() %>%
mutate(
class0 = (sum(c_across(class1:class30))),
class0 = ifelse(class0 == 1, 0, 1)
) %>%
ungroup() %>%
relocate(class0, .before = class1) %>%
gather(course_id, val, class0:class30) %>%
filter(val != 0) %>%
dplyr::select(-val) %>%
mutate(course_id = as.numeric(str_extract(course_id, "\\d.*"))) %>%
rowwise() %>%
mutate(beauty = mean(c_across(starts_with("beauty")))) %>%
ungroup() %>%
mutate(
rank = case_when(
tenuretrack == 1 & tenured == 1 ~ "tenured",
tenuretrack == 1 & tenured == 0 ~ "tenure track",
tenuretrack == 0 & tenured == 0 ~ "teaching"
)
) %>%
rename(
prof = profnumber,
eval = courseevaluation,
tenure = tenuretrack
) %>%
dplyr::select(
-starts_with("bty"),
-didevaluation,
-fulldept,
-multipleclass,
-onecredit,
-percentevaluating,
-profevaluation,
) %>%
rename_with(
.fn = ~ paste0("bty_", str_remove(.x, "beauty")),
.cols = beautyf2upper:beautymupperdiv
) %T>%
write_csv("data/evals.csv")
```