-
Notifications
You must be signed in to change notification settings - Fork 66
/
01-regression.Rmd
163 lines (107 loc) · 5.43 KB
/
01-regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# Linear regression warm-up
# Goals
- Make sure everyone is comfortable with dplyr and ggplot2
- Practice interpreting basic linear regression models
# The Gapminder data set
We're going to work with some data from the Gapminder project. This is from a project led by Han Rosling to bring data to issues of health, economics, and global development. If you haven't heard of Han Rosling, you should see some of his TED talks, e.g. <http://www.ted.com/talks/hans_rosling_and_the_magic_washing_machine>.
We'll work with a few variables from their full data sets (<https://www.gapminder.org/>), which have been included in an R package, `gapminder`, put together by Jenny Bryan at UBC <https://github.com/jennybc/gapminder>.
```{r, message=FALSE}
library(tidyverse)
library(broom) # for tidy()
d <- gapminder::gapminder # load the data into `d`
```
Let's look at the data:
```{r}
d
```
The column headers should be self-explanatory.
# Modeling life expectancy over time
Let's start with a simple model asking how global life expectancy has changed over time.
Use ggplot2 to visualize life expectancy (`lifeExp`) on the y-axis and time (`year`) on the x-axis.
Bonus: If you want to get fancy, you might consider lowering the `alpha` argument in `geom_point()` or adding some jittering to help with the overplotting (see `?position_jitter`).
```{r}
ggplot(d, aes(year, lifeExp)) + # exercise
geom_point(alpha = 0.5, position = position_jitter(width = 0.6)) # exercise
```
Try fitting a linear regression to predict `lifeExp` based on `year`:
```{r}
m1 <- lm(
lifeExp ~ year, data = d) # exercise
summary(m1)
coef(m1)
tidy(m1)
```
How do we interpret the intercept?
How do we interpret the slope? How would you fill in the following blank? With the passing of each year, life expectancy increases by `___` years.
## Working with a transformed predictor and a log-transformed response variable
What if we created a new variable `decade_1990` as the following:
```{r}
d <- mutate(d, decade_1990 = (year - 1990)/10)
ggplot(d, aes(decade_1990, log(lifeExp))) +
geom_point(alpha = 0.5, position = position_jitter(width = 0.1))
```
The choice of the year 1990 was completely arbitrary simply to make the intercept in the following models more meaningful. We divided by 10 so that the effects are per decade instead of per year. This keeps us from working with really tiny numbers.
And then we fit the following model. Note that this time, for the purposes of this exercise, we are modeling the *log* of life expectancy as the response.
```{r}
m2 <- lm(log(lifeExp) ~ decade_1990, data = d)
summary(m2)
coef(m2)
```
What do the intercept and slope represent now?
How would you answer the following questions based on the coefficients?
In 1990, the global expected life expectancy was:
```{r}
exp(coef(m2)[1]) # exercise
```
For every decade, global life expectancy increased on average `___` fold or `___`%.
```{r}
round(exp(coef(m2)[2]), 2) # exercise
round(exp(coef(m2)[2]) * 100 - 100, 0) # exercise
```
# Adding an interaction
Run the following model. I have added an interaction term with the factor variable `continent`:
```{r}
m3 <- lm(log(lifeExp) ~ decade_1990 * continent, data = d)
summary(m3)
```
The continent Africa is the base level of the factor variable `continent`. How would you answer the following questions? You don't need to fill in the actual R code, unless you'd like to. Just know how you would obtain these values.
What was the mean life expectancy in Africa in 1990?
```{r}
exp(coef(m3)["(Intercept)"]) # exercise
```
What was the mean life expectancy in Asia in 1990?
```{r}
exp(coef(m3)["(Intercept)"] + coef(m3)["continentAsia"]) # exercise
```
Over a decade in Europe, what is the expected fold increase in life expectancy?
```{r}
exp(coef(m3)["decade_1990"] + coef(m3)["decade_1990:continentEurope"]) # exercise
```
# Modeling life expectancy as a function of GDP per capita
Use functions from the dplyr package combined with the pipe `%>%` symbol to calculate the mean life expectancy and the mean GDP per capita for each country. Call your new columns `lifeExp` and `gdpPercap`. I've started it for you: (hint, you'll need to use the `summarise()` function)
```{r}
life <- d %>%
group_by(country) %>%
summarise(lifeExp = mean(lifeExp), gdpPercap = mean(gdpPercap)) # exercise
```
Let's look at the summarized data:
```{r}
ggplot(life, aes(log(gdpPercap), log(lifeExp))) + geom_point()
```
Fit a linear regression predicting log life expectancy from log GDP per capita with our new data set `life`.
```{r}
m4 <- lm(log(lifeExp) ~ log(gdpPercap), data = life) # exercise
summary(m4)
coef(m4)
```
How do we interpret the intercept and slope in this model?
How would you fill in the following blank?
For a 1% increase in GDP per capita, we expect a `___`% increase in life expectancy.
```{r}
round(coef(m4)[2], 2) # exercise %>%
```
# Addendum
These models are very basic and likely violate a number of statistical assumptions. (This exercise was focused only on interpretation, and we never even checked model fit.) We will come back to this data set later and fit some more realistic models.
For now, what are some of the major assumptions in the above models that are probably not valid or not ideal? How would you start checking those assumptions?
# Bonus
If you got this far, great! Try plotting the data creatively with ggplot2. Dig into the multiple dimensions. There are many patterns to be found in the data set and we will be coming back to it again later.