forked from rstudio-education/stat545
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path10_factors.Rmd
276 lines (197 loc) · 10.7 KB
/
10_factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# (PART) Data analysis 2 {-}
# Be the boss of your factors {#factors-boss}
```{r include = FALSE}
source("common.R")
```
<!--Original content: https://stat545.com/block029_factors.html-->
## Factors: where they fit in
We've spent a lot of time working with big, beautiful data frames, like the Gapminder data. But we also need to manage the individual variables housed within.
Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the **levels**. For example, the levels of the factor `continent` in Gapminder are are "Africa", "Americas", etc. and this is what's usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like "male/female" and "control/treated". But *never ever ever* forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.
This [Janus][wiki-janus]-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you learn how to be the boss of your factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.
**The worst kind of factor is the stealth factor.** The variable that you think of as character, but that is actually a factor (numeric!!). This is a classic R gotcha. Check your variable types explicitly when things seem weird. It happens to the best of us.
Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The happens most commonly at data import via `read.table()` and friends. But `data.frame()` and other functions are also eager to convert character to factor. To shut this down, use `stringsAsFactors = FALSE` in `read.table()` and `data.frame()` or -- even better -- **use the tidyverse**! For data import, use `readr::read_csv()`, `readr::read_tsv()`, etc. For data frame creation, use `tibble::tibble()`. And so on.
Good articles about how the factor fiasco came to be:
* [stringsAsFactors: An unauthorized biography][bio-strings-as-factors] by Roger Peng
* [stringsAsFactors = \<sigh\>][blog-strings-as-factors] by Thomas Lumley
## The forcats package
[forcats][forcats-web] is a core package in the tidyverse. It is installed via `install.packages("tidyverse")`, and loaded with `library(tidyverse)`. You can also install via `install.packages("forcats")`and load it yourself separately as needed via `library(forcats)`. Main functions start with `fct_`. There really is no coherent family of base functions that forcats replaces -- that's why it's such a welcome addition.
Currently this lesson will be mostly code vs prose. See the previous lesson for more discussion during the transition.
## Load forcats and gapminder
I choose to load the tidyverse, which will load forcats, among other packages we use incidentally below.
```{r start_factors}
library(tidyverse)
```
Also load gapminder.
```{r message = FALSE, warning = FALSE}
library(gapminder)
```
## Factor inspection
Get to know your factor before you start touching it! It's polite. Let's use `gapminder$continent` as our example.
```{r}
str(gapminder$continent)
levels(gapminder$continent)
nlevels(gapminder$continent)
class(gapminder$continent)
```
To get a frequency table as a tibble, from a tibble, use `dplyr::count()`. To get similar from a free-range factor, use `forcats::fct_count()`.
```{r}
gapminder %>%
count(continent)
fct_count(gapminder$continent)
```
## Dropping unused levels
Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. Sometimes all these unused levels can come back to haunt you later, e.g., in figure legends.
Watch what happens to the levels of `country` (= nothing) when we filter Gapminder to a handful of countries.
```{r}
nlevels(gapminder$country)
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
filter(country %in% h_countries)
nlevels(h_gap$country)
```
Even though `h_gap` only has data for a handful of countries, we are still schlepping around all `r nlevels(gapminder$country)` levels from the original `gapminder` tibble.
How can you get rid of them? The base function `droplevels()` operates on all the factors in a data frame or on a single factor. The function `forcats::fct_drop()` operates on a factor.
```{r}
h_gap_dropped <- h_gap %>%
droplevels()
nlevels(h_gap_dropped$country)
## use forcats::fct_drop() on a free-range factor
h_gap$country %>%
fct_drop() %>%
levels()
```
__Exercise:__ Filter the gapminder data down to rows where population is less than a quarter of a million, i.e. 250,000. Get rid of the unused factor levels for `country` and `continent` in different ways, such as:
* `droplevels()`
* `fct_drop()` inside `mutate()`
* `fct_dopr()` with `mutate_at()` or `mutate_if()`
```{r, eval = FALSE, include = FALSE}
low_pop <- gapminder %>%
filter(pop < 250000)
low_pop %>%
droplevels() %>%
group_by(country, continent) %>%
summarize(max_pop = max(pop))
low_pop %>%
mutate(country = fct_drop(country),
continent = fct_drop(continent)) %>%
group_by(country, continent) %>%
summarize(max_pop = max(pop))
```
## Change order of the levels, principled {#reorder-factors}
By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:
* Frequency. Make the most common level the first and so on.
* Another variable. Order factor levels according to a summary statistic for another variable. __Example:__ order Gapminder countries by life expectancy.
First, let's order continent by frequency, forwards and backwards. This is often a great idea for tables and figures, esp. frequency barplots.
```{r}
## default order is alphabetical
gapminder$continent %>%
levels()
## order by frequency
gapminder$continent %>%
fct_infreq() %>%
levels()
## backwards!
gapminder$continent %>%
fct_infreq() %>%
fct_rev() %>%
levels()
```
These two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer?
```{r, fig.show = 'hold', out.width = '49%', echo = FALSE}
p <- ggplot(gapminder, aes(x = continent)) +
geom_bar() +
coord_flip()
p
gap_tmp <- gapminder %>%
mutate(continent = continent %>% fct_infreq() %>% fct_rev())
p <- ggplot(gap_tmp, aes(x = continent)) +
geom_bar() +
coord_flip()
p
```
Now we order `country` by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor according to a grouped summary. The factor is the grouping variable and the default summarizing function is `median()` but you can specify something else.
```{r}
## order countries by median life expectancy
fct_reorder(gapminder$country, gapminder$lifeExp) %>%
levels() %>% head()
## order accoring to minimum life exp instead of median
fct_reorder(gapminder$country, gapminder$lifeExp, min) %>%
levels() %>% head()
## backwards!
fct_reorder(gapminder$country, gapminder$lifeExp, .desc = TRUE) %>%
levels() %>% head()
```
Example of why we reorder factor levels: often makes plots much better! When a factor is mapped to x or y, it should almost always be reordered by the quantitative variable you are mapping to the other one.
```{r include = FALSE, eval = FALSE}
boxplot(Sepal.Width ~ Species, data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width), data = iris)
boxplot(Sepal.Width ~ fct_reorder(Species, Sepal.Width, .desc = TRUE), data = iris)
```
Compare the interpretability of these two plots of life expectancy in Asian countries in 2007. The *only difference* is the order of the `country` factor. Which one do you find easier to learn from?
```{r alpha-order-silly, fig.show = 'hold', out.width = '49%'}
gap_asia_2007 <- gapminder %>% filter(year == 2007, continent == "Asia")
ggplot(gap_asia_2007, aes(x = lifeExp, y = country)) + geom_point()
ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
geom_point()
```
Use `fct_reorder2()` when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.
```{r legends-made-for-humans, fig.show = 'hold', out.width = '49%'}
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
filter(country %in% h_countries) %>%
droplevels()
ggplot(h_gap, aes(x = year, y = lifeExp, color = country)) +
geom_line()
ggplot(h_gap, aes(x = year, y = lifeExp,
color = fct_reorder2(country, year, lifeExp))) +
geom_line() +
labs(color = "country")
```
## Change order of the levels, "because I said so"
Sometimes you just want to hoist one or more levels to the front. Why? Because I said so. This resembles what we do when we move variables to the front with `dplyr::select(special_var, everything())`.
```{r}
h_gap$country %>% levels()
h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels()
```
This might be useful if you are preparing a report for, say, the Romanian government. The reason for always putting Romania first has nothing to do with the *data*, it is important for external reasons and you need a way to express this.
## Recode the levels
Sometimes you have better ideas about what certain levels should be. This is called recoding.
```{r}
i_gap <- gapminder %>%
filter(country %in% c("United States", "Sweden", "Australia")) %>%
droplevels()
i_gap$country %>% levels()
i_gap$country %>%
fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels()
```
__Exercise:__ Isolate the data for `"Australia"`, `"Korea, Dem. Rep."`, and `"Korea, Rep."` in the 2000x. Revalue the country factor levels to `"Oz"`, `"North Korea"`, and `"South Korea"`.
## Grow a factor
Let's create two data frames, each with data from two countries, dropping unused factor levels.
```{r}
df1 <- gapminder %>%
filter(country %in% c("United States", "Mexico"), year > 2000) %>%
droplevels()
df2 <- gapminder %>%
filter(country %in% c("France", "Germany"), year > 2000) %>%
droplevels()
```
The `country` factors in `df1` and `df2` have different levels.
```{r}
levels(df1$country)
levels(df2$country)
```
Can you just combine them?
```{r}
c(df1$country, df2$country)
```
Umm, no. That is wrong on many levels! Use `fct_c()` to do this.
```{r}
fct_c(df1$country, df2$country)
```
__Exercise:__ Explore how different forms of row binding work behave here, in terms of the `country` variable in the result.
```{r end_factors}
bind_rows(df1, df2)
rbind(df1, df2)
```
```{r links, child="links.md"}
```